Algorithms

The hana_ml.algorithms.pal package is consisted of many algorithms, when grouping by categories, these algorithms could be listed as follows:

This module contains supported PAL algorithms.

PAL Base

pal_base.PALBase([conn_context])

Subclass for PAL-specific functionality.

Auto ML

auto_ml.AutomaticClassification([scorings, ...])

AutomaticClassification offers an intelligent search amongst machine learning pipelines for supervised classification tasks.

auto_ml.AutomaticRegression([scorings, ...])

AutomaticRegression offers an intelligent search amongst machine learning pipelines for supervised regression tasks.

auto_ml.AutomaticTimeSeries([scorings, ...])

AutomaticTimeSeries offers an intelligent search amongst machine learning pipelines for time series tasks.

auto_ml.Preprocessing(name, **kwargs)

Preprocessing class.

Unified Interface

unified_classification.UnifiedClassification(func)

The Python wrapper for SAP HANA PAL unified-classification function.

unified_regression.UnifiedRegression(func[, ...])

The Python wrapper for SAP HANA PAL unified-regression function.

unified_clustering.UnifiedClustering(func[, ...])

The Python wrapper for SAP HANA PAL Unified Clustering function.

unified_exponentialsmoothing.UnifiedExponentialSmoothing(func)

The Python wrapper for SAP HANA PAL Unified Exponential Smoothing function.

Clustering

clustering.AffinityPropagation(affinity, ...)

Affinity Propagation is an algorithm that identifies exemplars among data points and forms clusters of data points around these exemplars.

clustering.AgglomerateHierarchicalClustering([...])

This algorithm is a widely used clustering method which can find natural groups within a set of data.

clustering.DBSCAN([minpts, eps, ...])

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based data clustering algorithm that finds a number of clusters starting from the estimated density distribution of corresponding nodes.

clustering.GeometryDBSCAN([minpts, eps, ...])

This function is a geometry version of DBSCAN, which only accepts geometry points as input data.

clustering.KMeans([n_clusters, ...])

K-Means model that handles clustering problems.

clustering.KMedians(n_clusters[, init, ...])

K-Medians clustering algorithm that partitions n observations into K clusters according to their nearest cluster center.

clustering.KMedoids(n_clusters[, init, ...])

K-Medoids clustering algorithm that partitions n observations into K clusters according to their nearest cluster center.

clustering.SpectralClustering(n_clusters[, ...])

This is the Python wrapper for PAL Spectral Clustering.

clustering.KMeansOutlier([n_clusters, ...])

Outlier detection of datasets using k-means clustering.

mixture.GaussianMixture(init_param[, ...])

Representation of a Gaussian mixture model probability distribution.

som.SOM([covergence_criterion, ...])

Self-organizing feature maps (SOMs) are one of the most popular neural network methods for cluster analysis.

clustering.SlightSilhouette(data[, ...])

Silhouette refers to a method used to validate the cluster of data.

clustering.outlier_detection_kmeans(data[, ...])

Outlier detection based on k-means clustering.

Classification

discriminant_analysis.LinearDiscriminantAnalysis([...])

Linear discriminant analysis for classification and data reduction.

linear_model.LogisticRegression([...])

Logistic regression model that handles binary-class and multi-class classification problems.

linear_model.OnlineMultiLogisticRegression(...)

This algorithm is the online version of Multi-Class Logistic Regression, while the Multi-Class Logistic Regression is offline/batch version.

naive_bayes.NaiveBayes([alpha, ...])

A classification model based on Bayes' theorem.

neighbors.KNNClassifier([n_neighbors, ...])

K-Nearest Neighbor (KNN) is a memory-based classification or regression method with no explicit training phase.

neural_network.MLPClassifier([activation, ...])

Multi-layer perceptron (MLP) Classifier.

svm.SVC([c, kernel, degree, gamma, ...])

Support Vector Classification.

svm.OneClassSVM([c, kernel, degree, gamma, ...])

One Class SVM

trees.DecisionTreeClassifier([algorithm, ...])

Decision Tree model for classification.

trees.RDTClassifier([n_estimators, ...])

Random Decision Tree model for classification.

trees.HybridGradientBoostingClassifier([...])

Hybrid Gradient Boosting trees model for classification.

Regression

linear_model.LinearRegression([solver, ...])

Linear regression is an approach to model the linear relationship between a variable, usually referred to as dependent variable, and one or more variables, usually referred to as independent variables, denoted as predictor vector.

linear_model.OnlineLinearRegression([...])

Online linear regression (Stateless) is an online version of the linear regression and is used when the training data are obtained multiple rounds.

neighbors.KNNRegressor([n_neighbors, ...])

K-Nearest Neighbor (KNN) is a memory-based classification or regression method with no explicit training phase.

neural_network.MLPRegressor([activation, ...])

Multi-layer perceptron (MLP) Regressor.

regression.PolynomialRegression([degree, ...])

Polynomial regression is an approach to model the relationship between a scalar variable y and a variable denoted X.

regression.GLM([family, link, solver, ...])

Regression by a generalized linear model, based on PAL_GLM.

regression.ExponentialRegression([...])

Exponential regression is an approach to modeling the relationship between a scalar variable y and one or more variables denoted X.

regression.BiVariateGeometricRegression([...])

Geometric regression is an approach used to model the relationship between a scalar variable y and a variable denoted X.

regression.BiVariateNaturalLogarithmicRegression([...])

Bi-variate natural logarithmic regression is an approach to modeling the relationship between a scalar variable y and one variable denoted X.

regression.CoxProportionalHazardModel([...])

Cox proportional hazard model (CoxPHM) is a special generalized linear model.

svm.SVR([c, kernel, degree, gamma, ...])

Support Vector Regression.

trees.DecisionTreeRegressor([algorithm, ...])

Decision Tree model for regression.

trees.RDTRegressor([n_estimators, ...])

Random Decision Tree model for regression.

trees.HybridGradientBoostingRegressor([...])

Hybrid Gradient Boosting model for regression.

Preprocessing

preprocessing.FeatureNormalizer([method, ...])

Normalize a DataFrame.

preprocessing.FeatureSelection(fs_method[, ...])

Feature selection(FS) is a dimensionality reduction technique, which selects a subset of relevant features for model construction, thus reducing the memory storage and improving computational efficiency while avoiding significant loss of information.

preprocessing.IsolationForest([...])

Isolation Forest generates anomaly score of each sample.

preprocessing.KBinsDiscretizer(strategy, ...)

Bin continuous data into number of intervals and perform local smoothing.

preprocessing.Imputer([strategy, ...])

Missing value imputation for DataFrame.

preprocessing.Discretize(strategy[, n_bins, ...])

It is an enhanced version of binning function which can be applied to table with multiple columns.

preprocessing.MDS(matrix_type[, ...])

This class serves as a tool for dimensional reduction or data visualization.

preprocessing.SMOTE([smote_amount, ...])

This class is to handle imbalanced dataset.

preprocessing.SMOTETomek([smote_amount, ...])

This class combines over-sampling using SMOTE and cleaning(under-sampling) using Tomek links.

preprocessing.TomekLinks([distance_level, ...])

This class is for performing under-sampling by removing Tomek's links.

preprocessing.Sampling(method[, interval, ...])

This class is used to choose a small portion of the records as representatives.

preprocessing.ImputeTS([imputation_type, ...])

Imputation of multi-dimensional time-series data.

preprocessing.PowerTransform([method, ...])

This class implements a python interface for the power transform algorithm in PAL.

decomposition.PCA([scaling, thread_ratio, ...])

Principal component analysis is to reduce the dimensionality of multivariate data using Singular Value Decomposition.

decomposition.CATPCA([scaling, ...])

Principal components analysis algorithm that supports categorical features.

partition.train_test_val_split(data[, ...])

The algorithm partitions an input dataset randomly into three disjoint subsets called training, testing and validation.

preprocessing.variance_test(data, sigma_num)

Variance Test is a method to identify the outliers of n number of numeric data {xi} where 0 < i < n+1, using the mean and the standard deviation of n number of numeric data.

Time Series

tsa.additive_model_forecast.AdditiveModelForecast([...])

Additive Model Forecast uses a decomposable time series model with three main components: trend, seasonality, and holidays or event.

tsa.arima.ARIMA([order, seasonal_order, ...])

Autoregressive Integrated Moving Average ARIMA(p, d, q) model.

tsa.auto_arima.AutoARIMA([seasonal_period, ...])

Although the ARIMA model is useful and powerful in time series analysis, it is somehow difficult to choose appropriate orders.

tsa.changepoint.CPD([cost, penalty, solver, ...])

Change-point detection (CPDetection) methods aim at detecting multiple abrupt changes such as change in mean, variance or distribution in an observed time-series data.

tsa.changepoint.BCPD(max_tcp, max_scp[, ...])

Bayesian Change-point detection (BCPD) detects abrupt changes in the time series.

tsa.changepoint.OnlineBCPD([alpha, beta, ...])

Online Bayesian Change-point detection.

tsa.bsts.BSTS([burn, niter, ...])

class for Bayesian structure time-series(BSTS).

tsa.classification.TimeSeriesClassification([...])

Time series classification.

tsa.exponential_smoothing.SingleExponentialSmoothing([...])

Single exponential smoothing is suitable to model the time series without trend and seasonality.

tsa.exponential_smoothing.DoubleExponentialSmoothing([...])

Double exponential smoothing is suitable to model the time series with trend but without seasonality.

tsa.exponential_smoothing.TripleExponentialSmoothing([...])

Triple exponential smoothing is used to handle the time series data containing a seasonal component.

tsa.exponential_smoothing.AutoExponentialSmoothing([...])

Auto exponential smoothing (previously named forecast smoothing) is used to calculate optimal parameters of a set of smoothing functions in SAP HANA PAL, including Single Exponential Smoothing, Double Exponential Smoothing, and Triple Exponential Smoothing.

tsa.exponential_smoothing.BrownExponentialSmoothing([...])

Brown exponential smoothing is suitable to model the time series with trend but without seasonality.

tsa.exponential_smoothing.Croston([alpha, ...])

Croston method is a forecast strategy for products with intermittent demand.

tsa.exponential_smoothing.CrostonTSB([...])

Croston TSB method (for Teunter, Syntetos & Babai) is a forecast strategy for products with intermittent demand.

tsa.garch.GARCH([p, q, model_type])

Generalized AutoRegressive Conditional Heteroskedasticity (GARCH) is a statistic model used to analysis variance of error (innovation or residual) term in time series.

tsa.hierarchical_forecast.Hierarchical_Forecast([...])

Hierarchical forecast algorithm forecast across the hierarchy (that is, ensuring the forecasts sum appropriately across the levels).

tsa.lr_seasonal_adjust.LR_seasonal_adjust([...])

Linear regression with damped trend and seasonal adjust is an approach for forecasting when a time series presents a trend.

tsa.lstm.LSTM([learning_rate, gru, ...])

Long short-term memory (LSTM).

tsa.ltsf.LTSF([batch_size, num_epochs, ...])

Long-Term Series Forecasting (LTSF).

tsa.online_algorithms.OnlineARIMA([order, ...])

Online Autoregressive Integrated Moving Average ARIMA(p, d, q) model.

tsa.outlier_detection.OutlierDetectionTS([...])

Outlier detection for time-series.

tsa.rnn.GRUAttention([learning_rate, ...])

Gated Recurrent Units(GRU) based encoder-decoder model with Attention mechanism for time series prediction.

tsa.rocket.ROCKET([method, num_features, ...])

RandOm Convolutional KErnel Transform (ROCKET) is an exceptionally efficient algorithm for time series classification.

tsa.vector_arima.VectorARIMA([order, ...])

Vector Autoregressive Integrated Moving Average ARIMA(p, d, q) model.

tsa.wavelet.DWT(wavelet[, boundary, level, ...])

A designed class for discrete wavelet transform and wavelet packet transform.

tsa.accuracy_measure.accuracy_measure(data)

Measures are used to check the accuracy of the forecast made by PAL algorithms.

tsa.correlation_function.correlation(data[, ...])

This correlation function gives the statistical correlation between random variables.

tsa.fft.fft(data[, num_type, inverse, ...])

Apply Fast-Fourier-Transform to the input data, and return the transformed data.

tsa.dtw.dtw(query_data, ref_data[, radius, ...])

DTW is an abbreviation for Dynamic Time Warping.

tsa.fast_dtw.fast_dtw(data, radius[, ...])

DTW is an abbreviation for Dynamic Time Warping.

tsa.intermittent_forecast.intermittent_forecast(data)

This function is a wrapper for PAL Intermittent Time Series Forecast (ITSF), which is a new forecast strategy for products with intermittent demand.

tsa.periodogram.periodogram(data[, key, ...])

Periodogram is an estimate of the spectral density of a signal.

tsa.stationarity_test.stationarity_test(data)

Stationarity means that a time series has a constant mean and constant variance over time.

tsa.seasonal_decompose.seasonal_decompose(data)

seasonal_decompose function is to decompose a time series into three components: trend, seasonality and random noise.

tsa.trend_test.trend_test(data[, key, ...])

Trend test is able to identify whether a time series has an upward or downward trend or not, and calculate the de-trended time series.

tsa.wavelet.wavedec(data, wavelet[, key, ...])

Python wrapper for PAL multi-level discrete wavelet transform.

tsa.wavelet.waverec(dwt[, wavelet, boundary])

Python wrapper for PAL multi-level inverse discrete wavelet transform.

tsa.wavelet.wpdec(data, wavelet[, key, col, ...])

Python wrapper for PAL multi-level (discrete) wavelet packet transformation.

tsa.wavelet.wprec(dwt[, wavelet, boundary])

Python wrapper for PAL multi-level inverse discrete wavelet transform.

tsa.white_noise_test.white_noise_test(data)

This algorithm is used to identify whether a time series is a white noise series.

Statistics

random.bernoulli(conn_context[, p, ...])

Draw samples from a Bernoulli distribution.

random.beta(conn_context[, a, b, ...])

Draw samples from a Beta distribution.

random.binomial(conn_context[, n, p, ...])

Draw samples from a binomial distribution.

random.cauchy(conn_context[, location, ...])

Draw samples from a cauchy distribution.

random.chi_squared(conn_context[, dof, ...])

Draw samples from a chi_squared distribution.

random.exponential(conn_context[, lamb, ...])

Draw samples from an exponential distribution.

random.gumbel(conn_context[, location, ...])

Draw samples from a Gumbel distribution, which is one of a class of Generalized Extreme Value (GEV) distributions used in modeling extreme value problems.

random.f(conn_context[, dof1, dof2, ...])

Draw samples from an f distribution.

random.gamma(conn_context[, shape, scale, ...])

Draw samples from a gamma distribution.

random.geometric(conn_context[, p, ...])

Draw samples from a geometric distribution.

random.lognormal(conn_context[, mean, ...])

Draw samples from a lognormal distribution.

random.negative_binomial(conn_context[, n, ...])

Draw samples from a negative_binomial distribution.

random.normal(conn_context[, mean, sigma, ...])

Draw samples from a normal distribution.

random.pert(conn_context[, minimum, mode, ...])

Draw samples from a PERT distribution.

random.poisson(conn_context[, theta, ...])

Draw samples from a poisson distribution.

random.student_t(conn_context[, dof, ...])

Draw samples from a Student's t-distribution.

random.uniform(conn_context[, low, high, ...])

Draw samples from a uniform distribution.

random.weibull(conn_context[, shape, scale, ...])

Draw samples from a weibull distribution.

random.multinomial(conn_context, n, pvals[, ...])

Draw samples from a multinomial distribution.

random.mcmc(conn_context, distribution[, ...])

Given a distribution, this function generates samples of the distribution using Markov chain Monte Carlo simulation.

stats.chi_squared_goodness_of_fit(data, key)

Perform the chi-squared goodness-of fit test to tell whether or not an observed distribution differs from an expected chi-squared distribution.

stats.chi_squared_independence(data, key[, ...])

Perform the chi-squared test of independence to tell whether observations of two variables are independent from each other.

stats.ttest_1samp(data[, col, mu, ...])

Perform the t-test to determine whether a sample of observations could have been generated by a process with a specific mean.

stats.ttest_ind(data[, col1, col2, mu, ...])

Perform the T-test for the mean difference of two independent samples.

stats.ttest_paired(data[, col1, col2, mu, ...])

Perform the t-test for the mean difference of two sets of paired samples.

stats.f_oneway(data[, group, sample, ...])

Performs a 1-way ANOVA.

stats.f_oneway_repeated(data, subject_id[, ...])

Performs one-way repeated measures analysis of variance, along with Mauchly's Test of Sphericity and post hoc multiple comparison tests.

stats.univariate_analysis(data[, key, cols, ...])

Provides an overview of the dataset.

stats.covariance_matrix(data[, cols])

Computes the covariance matrix.

stats.pearsonr_matrix(data[, cols])

Computes a correlation matrix using Pearson's correlation coefficient.

stats.iqr(data, key[, col, multiplier])

Perform the inter-quartile range (IQR) test to find the outliers of the data.

stats.wilcoxon(data[, col, mu, test_type, ...])

Perform a one-sample or paired two-sample non-parametric test to check whether the median of the data is different from a specific value.

stats.median_test_1samp(data[, col, mu, ...])

Perform one-sample non-parametric test to check whether the median of the data is different from a user specified one.

stats.grubbs_test(data, key[, col, method, ...])

Perform grubbs' test to detect outliers from a given univariate data set.

stats.entropy(data[, col, ...])

This function is used to calculate the information entropy of attributes.

stats.condition_index(data[, key, col, ...])

Condition index is used to detect collinearity problem between independent variables which are later used as predictors in a multiple linear regression model.

stats.cdf(data, distr_info[, col, complementary])

This algorithm evaluates the probability of a variable x from the cumulative distribution function (CDF) or complementary cumulative distribution function (CCDF) for a given probability distribution.

stats.ftest_equal_var(data_x, data_y[, ...])

This function is used to test the equality of two random variances using F-test.

stats.factor_analysis(data, key, factor_num)

Factor analysis is a statistical method that tries to extract a low number of unobserved variables, i.e. factors, that can best describe the covariance pattern of a larger set of observed variables.

stats.kaplan_meier_survival_analysis(data[, ...])

The Kaplan-Meier estimator is a non-parametric statistic used to estimate the survival function from lifetime data.

stats.quantile(data, distr_info[, col, ...])

This algorithm evaluates the inverse of the cumulative distribution function (CDF) or the inverse of the complementary cumulative distribution function (CCDF) for a given probability p and probability distribution.

stats.distribution_fit(data, distr_type[, ...])

This algorithm aims to fit a probability distribution for a variable according to a series of measurements to the variable.

stats.ks_test(data[, distribution_name, ...])

This function performs one-sample or two-sample Kolmogorov-Smirnov test for goodness of fit.

stats.interval_quality(data, significance_level)

The function provides a method to evaluate the quality of interval forecasts, which defined as:

kernel_density.KDE([thread_ratio, ...])

Perform Kernel Density to analogue with histograms whereas getting rid of its defects.

Association

association.Apriori(min_support, min_confidence)

Apriori is a classic predictive analysis algorithm for finding association rules used in association analysis.

association.AprioriLite(min_support, ...[, ...])

A light version of Apriori algorithm for association rule mining, where only two large item sets are calculated.

association.FPGrowth([min_support, ...])

FP-Growth is an algorithm to find frequent patterns from transactions without generating a candidate itemset.

association.KORD([k, measure, min_support, ...])

K-optimal rule discovery (KORD) follows the idea of generating association rules with respect to a well-defined measure, instead of first finding all frequent itemsets and then generating all possible rules.

association.SPM(min_support[, relational, ...])

The sequential pattern mining algorithm searches for frequent patterns in sequence databases.

Recommender System

recommender.ALS([random_state, max_iter, ...])

Alternating least squares (ALS) is a powerful matrix factorization algorithm for building both explicit and implicit feedback based recommender systems.

recommender.FRM([solver, factor_num, init, ...])

Factorized Polynomial Regression Models or Factorization Machines approach.

recommender.FFMClassifier([ordering, ...])

Field-Aware Factorization Machine with the task of classification.

recommender.FFMRegressor([ordering, ...])

Field-Aware Factorization Machine with the task of Regression.

recommender.FFMRanker([ordering, normalise, ...])

Field-Aware Factorization Machine with the task of ranking.

Social Network Analysis

linkpred.LinkPrediction(method[, beta, ...])

Link predictor for calculating, in a network, proximity scores between nodes that are not directly linked, which is helpful for predicting missing links(the higher the proximity score is, the more likely the two nodes are to be linked).

pagerank.PageRank([damping, max_iter, tol, ...])

A page rank model.

Ranking

svm.SVRanking([c, kernel, degree, gamma, ...])

Support Vector Ranking

Miscellaneous

abc_analysis.abc_analysis(data[, key, ...])

This algorithm is used to classify objects (such as customers, employees, or products) based on a particular measure (such as revenue or profit).

wst.weighted_score_table(data, maps, ...[, ...])

Perform the weighted_score_table to weight the score by the importance of each criterion.

tsne.TSNE([n_iter, learning_rate, ...])

Class for T-distributed Stochastic Neighbour Embedding.

Metrics

metrics.accuracy_score(data, label_true, ...)

Compute mean accuracy score for classification results.

metrics.auc(data[, positive_label, ...])

Computes area under curve (AUC) to evaluate the performance of binary-class classification algorithms.

metrics.confusion_matrix(data, key[, ...])

Computes confusion matrix to evaluate the accuracy of a classification.

metrics.multiclass_auc(data_original, ...)

Computes area under curve (AUC) to evaluate the performance of multi-class classification algorithms.

metrics.r2_score(data, label_true, label_pred)

Computes coefficient of determination for regression results.

Model and Pipeline

model_selection.ParamSearchCV(estimator, ...)

Exhaustive or random search over specified parameter values for an estimator with crossover validation (CV).

model_selection.GridSearchCV(estimator, ...)

Exhaustive search over specified parameter values for an estimator with crossover validation (CV).

model_selection.RandomSearchCV(estimator, ...)

Random search over specified parameter values for an estimator with crossover validation (CV).

pipeline.Pipeline(steps)

Pipeline construction to run transformers and estimators sequentially.

Text Processing

crf.CRF([lamb, epsilon, max_iter, lbfgs_m, ...])

Conditional random field(CRF) for labeling and segmenting sequence data(e.g. text).

decomposition.LatentDirichletAllocation(...)

Latent Dirichlet allocation (LDA) is a generative model in which each item (word) of a collection (document) is generated from a finite mixture over several latent groups (topics).

For other text processing methods like text mining, please see text mining module for more details.