hana_ml.algorithms.pal package

The Algorithms PAL Package consists of the following sections:

hana_ml.algorithms.pal.clustering

This module contains PAL wrapper and helper functions for clustering algorithms. The following classes are available:

class hana_ml.algorithms.pal.clustering.DBSCAN(conn_context, minpts=None, eps=None, thread_ratio=None, metric=None, minkowski_power=None, categorical_variable=None, category_weights=None, algorithm=None, save_model=True)

Bases: hana_ml.algorithms.pal.pal_base.PALBase, hana_ml.algorithms.pal.clustering._ClusterAssignmentMixin

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based data clustering algorithm that finds a number of clusters starting from the estimated density distribution of corresponding nodes.

Parameters:
conn_context : ConnectionContext

Connection to the HANA system.

minpts : int, optional

The minimum number of points required to form a cluster.

eps : float, optional

The scan radius.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to heuristically determined.

metric : str, optional

Ways to compute the distance between two points.

  • ‘manhattan’
  • ‘euclidean’
  • ‘minkowski’
  • ‘chebyshev’
  • ‘standardized_euclidean’
  • ‘cosine’

Defaults to euclidean.

minkowski_power : int, optional

When minkowski is choosed for metric, this parameter controls the value of power. Only applicable when metric is minkowski. Defaults to 3.

categorical_variable : list of str, optional

Column names in the data table used as category variable.

category_weights : float, optional

Represents the weight of category attributes. Defaults to 0.707.

algorithm : str, optional

Ways to search for neighbours.

  • ‘brute-force’
  • ‘kd-tree’

Defaults to kd-tree.

save_model : bool, optional

If true, the generated model will be saved. save_model must be True to call predict(). Defaults to True.

Examples

Input dataframe for clustering:

>>> df.collect()
    ID     V1     V2 V3
0    1   0.10   0.10  B
1    2   0.11   0.10  A
2    3   0.10   0.11  C
3    4   0.11   0.11  B
4    5   0.12   0.11  A
5    6   0.11   0.12  E
6    7   0.12   0.12  A
7    8   0.12   0.13  C
8    9   0.13   0.12  D
9   10   0.13   0.13  D
10  11   0.13   0.14  A
11  12   0.14   0.13  C
12  13  10.10  10.10  A
13  14  10.11  10.10  F
14  15  10.10  10.11  E
15  16  10.11  10.11  E
16  17  10.11  10.12  A
17  18  10.12  10.11  B
18  19  10.12  10.12  B
19  20  10.12  10.13  D
20  21  10.13  10.12  F
21  22  10.13  10.13  A
22  23  10.13  10.14  A
23  24  10.14  10.13  D
24  25   4.10   4.10  A
25  26   7.11   7.10  C
26  27  -3.10  -3.11  C
27  28  16.11  16.11  A
28  29  20.11  20.12  C
29  30  15.12  15.11  A

Create DSBCAN instance:

>>> dbscan = DBSCAN(conn_context=cc, thread_ratio=0.2, metric='manhattan')

Perform fit on the given data:

>>> dbscan.fit(df, 'ID')

Expected output:

>>> dbscan.labels_.collect()
    ID  CLUSTER_ID
0    1           0
1    2           0
2    3           0
3    4           0
4    5           0
5    6           0
6    7           0
7    8           0
8    9           0
9   10           0
10  11           0
11  12           0
12  13           1
13  14           1
14  15           1
15  16           1
16  17           1
17  18           1
18  19           1
19  20           1
20  21           1
21  22           1
22  23           1
23  24           1
24  25          -1
25  26          -1
26  27          -1
27  28          -1
28  29          -1
29  30          -1
Attributes:
labels_ : DataFrame

Label assigned to each sample.

model_ : DataFrame

Model content. Set to None if save_model is False.

Methods

fit(data, key[, features]) Fit the DBSCAN model when given the training dataset.
fit_predict(data, key[, features]) Fit with the dataset and return the labels.
predict(data, key[, features]) Assign clusters to data based on a fitted model.
fit(data, key, features=None)

Fit the DBSCAN model when given the training dataset.

Parameters:
data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the features columns. If features is not provided, it defaults to all the non-ID columns.

fit_predict(data, key, features=None)

Fit with the dataset and return the labels.

Parameters:
data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the features columns. If features is not provided, it defaults to all the non-ID columns.

Returns:
DataFrame
Fit result, structured as follows:
  • ID column, with the same name and type as data’s ID column.
  • CLUSTER_ID, type INTEGER, cluster ID assigned to the data point. (Cluster IDs range from 0 to 1 less than the number of clusters. A cluster ID of -1 means the point is labeled as noise.)
predict(data, key, features=None)

Assign clusters to data based on a fitted model.

The output structure of this method does not match that of fit_predict().

Parameters:
data : DataFrame

Data points to match against computed clusters. This dataframe’s column structure should match that of the data used for fit().

key : str

Name of the ID column.

features : list of str, optional.

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

Returns:
DataFrame
Cluster assignment results, with 3 columns:
  • Data point ID, with name and type taken from the input ID column.
  • CLUSTER_ID, type INTEGER, representing the cluster the data point is assigned to.
  • DISTANCE, type DOUBLE, representing the distance between the data point and the nearest core point.
class hana_ml.algorithms.pal.clustering.KMeans(conn_context, n_clusters=None, n_clusters_min=None, n_clusters_max=None, init=None, max_iter=None, thread_ratio=None, distance_level=None, minkowski_power=None, category_weights=None, normalization=None, categorical_variable=None, tol=None, memory_mode=None, accelerated=False)

Bases: hana_ml.algorithms.pal.pal_base.PALBase, hana_ml.algorithms.pal.clustering._ClusterAssignmentMixin

K-Means model that handles clustering problems.

Parameters:
conn_context : ConnectionContext

Connection to the HANA system.

n_clusters : int, optional

Number of clusters. If this parameter is not specified, you must specify the minimum and maximum range parameters instead.

n_clusters_min : int, optional

Cluster range minimum.

n_clusters_max : int, optional

Cluster range maximum.

init : {‘first_k’, ‘replace’, ‘no_replace’, ‘patent’}, optional

Controls how the initial centers are selected:

  • ‘first_k’: First k observations.
  • ‘replace’: Random with replacement.
  • ‘no_replace’: Random without replacement.
  • ‘patent’: Patent of selecting the init center (US 6,882,998 B1).

Defaults to patent.

max_iter : int, optional

Max iterations. Defaults to 100.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.

distance_level : str, optional

Ways to compute the distance between the item and the cluster center.

  • ‘manhattan’
  • ‘euclidean’
  • ‘minkowski’
  • ‘chebyshev’
  • ‘cosine’

Defaults to euclidean. ‘cosine’ is only valid when accelerated is False.

minkowski_power : float, optional

When Minkowski distance is used, this parameter controls the value of power. Only valid when distance_level is minkowski. Defaults to 3.0.

category_weights : float, optional

Represents the weight of category attributes. Defaults to 0.707.

normalization : {‘no’, ‘l1_norm’, ‘min_max’}, optional

Normalization type.

  • ‘no’: No normalization will be applied.
  • ‘l1_norm’: Yes, for each point X (x1,x2,…,xn), the normalized value will be X’(x1/S,x2/S,…,xn/S), where S = |x1|+|x2|+...|xn|.
  • ‘min_max’: Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to no.

categorical_variable : list of str, optional

Column names in the data table used as category variable.

tol : float, optional

Convergence threshold for exiting iterations. Only valid when accelerated is False. Defaults to 1.0e-6.

memory_mode : {‘auto’, ‘optimize-speed’, ‘optimize-space’}, optional

Indicates the memory mode that the algorithm uses.

  • ‘auto’: Chosen by algorithm.
  • ‘optimize-speed’: Prioritizes speed.
  • ‘optimize-space’: Prioritizes memory.

Only valid when accelerated is True. Defaults to auto.

accelerated : bool, optional

Indicates whether to use technology like cache to accelerate the calculation process. If True, the calculation process will be accelerated. If False, the calculation process will not be accelerated. Defaults to False.

Examples

Input dataframe for clustering:

>>> df.collect()
    ID  V000 V001  V002
0    0   0.5    A   0.5
1    1   1.5    A   0.5
2    2   1.5    A   1.5
3    3   0.5    A   1.5
4    4   1.1    B   1.2
5    5   0.5    B  15.5
6    6   1.5    B  15.5
7    7   1.5    B  16.5
8    8   0.5    B  16.5
9    9   1.2    C  16.1
10  10  15.5    C  15.5
11  11  16.5    C  15.5
12  12  16.5    C  16.5
13  13  15.5    C  16.5
14  14  15.6    D  16.2
15  15  15.5    D   0.5
16  16  16.5    D   0.5
17  17  16.5    D   1.5
18  18  15.5    D   1.5
19  19  15.7    A   1.6

Create KMeans instance:

>>> km = clustering.KMeans(conn_context=cc, n_clusters=4, init='first_k',
...                        max_iter=100, tol=1.0E-6, thread_ratio=0.2,
...                        distance_level='Euclidean',
...                        category_weights=0.5)

Perform fit_predict:

>>> labels = km.fit_predict(df, 'ID')
>>> labels.collect()
    ID  CLUSTER_ID  DISTANCE  SLIGHT_SILHOUETTE
0    0           0  0.891088           0.944370
1    1           0  0.863917           0.942478
2    2           0  0.806252           0.946288
3    3           0  0.835684           0.944942
4    4           0  0.744571           0.950234
5    5           3  0.891088           0.940733
6    6           3  0.835684           0.944412
7    7           3  0.806252           0.946519
8    8           3  0.863917           0.946121
9    9           3  0.744571           0.949899
10  10           2  0.825527           0.945092
11  11           2  0.933886           0.937902
12  12           2  0.881692           0.945008
13  13           2  0.764318           0.949160
14  14           2  0.923456           0.939283
15  15           1  0.901684           0.940436
16  16           1  0.976885           0.939386
17  17           1  0.818178           0.945878
18  18           1  0.722799           0.952170
19  19           1  1.102342           0.925679
>>> df = cc.table("PAL_ACCKMEANS_DATA_TBL")
>>> df.collect()
    ID  V000 V001  V002
0    0   0.5    A     0
1    1   1.5    A     0
2    2   1.5    A     1
3    3   0.5    A     1
4    4   1.1    B     1
5    5   0.5    B    15
6    6   1.5    B    15
7    7   1.5    B    16
8    8   0.5    B    16
9    9   1.2    C    16
10  10  15.5    C    15
11  11  16.5    C    15
12  12  16.5    C    16
13  13  15.5    C    16
14  14  15.6    D    16
15  15  15.5    D     0
16  16  16.5    D     0
17  17  16.5    D     1
18  18  15.5    D     1
19  19  15.7    A     1

Create Accelerated Kmeans instance:

>>> akm = clustering.KMeans(conn_context=cc, init='first_k',
...                         thread_ratio=0.5, n_clusters=4,
...                         distance_level='euclidean',
...                         max_iter=100, category_weights=0.5,
...                         categorical_variable=['V002'],
...                         accelerated=True)

Perform fit_predict:

>>> labels = akm.fit_predict(df, 'ID')
>>> labels.collect()
    ID  CLUSTER_ID  DISTANCE  SLIGHT_SILHOUETTE
0    0           0  1.198938           0.006767
1    1           0  1.123938           0.068899
2    2           3  0.500000           0.572506
3    3           3  0.500000           0.598267
4    4           0  0.621517           0.229945
5    5           0  1.037500           0.308333
6    6           0  0.962500           0.358333
7    7           0  0.895513           0.402992
8    8           0  0.970513           0.352992
9    9           0  0.823938           0.313385
10  10           1  1.038276           0.931555
11  11           1  1.178276           0.927130
12  12           1  1.135685           0.929565
13  13           1  0.995685           0.934165
14  14           1  0.849615           0.944359
15  15           1  0.995685           0.934548
16  16           1  1.135685           0.929950
17  17           1  1.089615           0.932769
18  18           1  0.949615           0.937555
19  19           1  0.915565           0.937717
Attributes:
labels_ : DataFrame

Label assigned to each sample.

cluster_centers_ : DataFrame

Coordinates of cluster centers.

model_ : DataFrame

Model content.

statistics_ : DataFrame

Statistic value.

Methods

fit(data, key[, features]) Fit the model when given training dataset.
fit_predict(data, key[, features]) Fit with the dataset and return the labels.
predict(data, key[, features]) Assign clusters to data based on a fitted model.
fit(data, key, features=None)

Fit the model when given training dataset.

Parameters:
data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

fit_predict(data, key, features=None)

Fit with the dataset and return the labels.

Parameters:
data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

Returns:
DataFrame
Fit result, structured as follows:
  • ID column, with the same name and type as data’s ID column.
  • CLUSTER_ID, type INTEGER, cluster ID assigned to the data point.
  • DISTANCE, type DOUBLE, the distance between the given point and the cluster center.
  • SLIGHT_SILHOUETTE, type DOUBLE, estimated value (slight silhouette).
predict(data, key, features=None)

Assign clusters to data based on a fitted model.

The output structure of this method does not match that of fit_predict().

Parameters:
data : DataFrame

Data points to match against computed clusters. This dataframe’s column structure should match that of the data used for fit().

key : str

Name of the ID column.

features : list of str, optional.

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

Returns:
DataFrame
Cluster assignment results, with 3 columns:
  • Data point ID, with name and type taken from the input ID column.
  • CLUSTER_ID, INTEGER type, representing the cluster the data point is assigned to.
  • DISTANCE, DOUBLE type, representing the distance between the data point and the cluster center.
class hana_ml.algorithms.pal.clustering.KMedians(conn_context, n_clusters, init=None, max_iter=None, tol=None, thread_ratio=None, distance_level=None, minkowski_power=None, category_weights=None, normalization=None, categorical_variable=None)

Bases: hana_ml.algorithms.pal.clustering._KClusteringBase

K-Medians clustering algorithm that partitions n observations into K clusters according to their nearest cluster center. It uses medians of each feature to calculate cluster centers.

Parameters:
conn_context : ConnectionContext

Database connection object.

n_clusters : int

Number of groups.

init : {‘first_k’, ‘replace’, ‘no_replace’, ‘patent’}, optional

Controls how the initial centers are selected:

  • ‘first_k’: First k observations.
  • ‘replace’: Random with replacement.
  • ‘no_replace’: Random without replacement.
  • ‘patent’: Patent of selecting the init center (US 6,882,998 B1).

Defaults to patent.

max_iter : int, optional

Max iterations. Defaults to 100.

tol : float, optional

Convergence threshold for exiting iterations. Defaults to 1.0e-6.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.

distance_level : str, optional

Ways to compute the distance between the item and the cluster center.

  • ‘manhattan’
  • ‘euclidean’
  • ‘minkowski’
  • ‘chebyshev’
  • ‘cosine’

Defaults to euclidean.

minkowski_power : float, optional

When Minkowski distance is used, this parameter controls the value of power. Only valid when distance_level is minkowski. Defaults to 3.0.

category_weights : float, optional

Represents the weight of category attributes. Defaults to 0.707.

normalization : {‘no’, ‘l1_norm’, ‘min_max’}, optional

Normalization type.

  • ‘no’: No, normalization will not be applied.
  • ‘l1_norm’: Yes, for each point X (x1,x2,…,xn), the normalized value will be X’(x1/S,x2/S,…,xn/S), where S = |x1|+|x2|+...|xn|.
  • ‘min_max’: Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to no.

categorical_variable : list of str, optional

Column names in the data table used as category variable.

Examples

Input dataframe for clustering:

>>> df1.collect()
    ID  V000 V001  V002
0    0   0.5    A   0.5
1    1   1.5    A   0.5
2    2   1.5    A   1.5
3    3   0.5    A   1.5
4    4   1.1    B   1.2
5    5   0.5    B  15.5
6    6   1.5    B  15.5
7    7   1.5    B  16.5
8    8   0.5    B  16.5
9    9   1.2    C  16.1
10  10  15.5    C  15.5
11  11  16.5    C  15.5
12  12  16.5    C  16.5
13  13  15.5    C  16.5
14  14  15.6    D  16.2
15  15  15.5    D   0.5
16  16  16.5    D   0.5
17  17  16.5    D   1.5
18  18  15.5    D   1.5
19  19  15.7    A   1.6

Creating KMedians instance:

>>> kmedians = KMedians(conn_context=cc, n_clusters=4, init='first_k',
...                     max_iter=100, tol=1.0E-6,
...                     distance_level='Euclidean',
...                     thread_ratio=0.3, category_weights=0.5)

Performing fit() on given dataframe:

>>> kmedians.fit(df1, 'ID')
>>> kmedians.cluster_centers_.collect()
   CLUSTER_ID  V000 V001  V002
0           0   1.1    A   1.2
1           1  15.7    D   1.5
2           2  15.6    C  16.2
3           3   1.2    B  16.1

Performing fit_predict() on given dataframe:

>>> kmedians.fit_predict(df1, 'ID').collect()
    ID  CLUSTER_ID  DISTANCE
0    0           0  0.921954
1    1           0  0.806226
2    2           0  0.500000
3    3           0  0.670820
4    4           0  0.707107
5    5           3  0.921954
6    6           3  0.670820
7    7           3  0.500000
8    8           3  0.806226
9    9           3  0.707107
10  10           2  0.707107
11  11           2  1.140175
12  12           2  0.948683
13  13           2  0.316228
14  14           2  0.707107
15  15           1  1.019804
16  16           1  1.280625
17  17           1  0.800000
18  18           1  0.200000
19  19           1  0.807107
Attributes:
cluster_centers_ : DataFrame

Coordinates of cluster centers.

labels_ : DataFrame

Cluster assignment and distance to cluster center for each point.

Methods

fit(data, key[, features]) Perform clustering on input dataset.
fit_predict(data, key[, features]) Perform clustering algorithm and return labels.
fit(data, key, features=None)

Perform clustering on input dataset.

Parameters:
data : DataFrame

DataFrame contains input data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

fit_predict(data, key, features=None)

Perform clustering algorithm and return labels.

Parameters:
data : DataFrame

DataFrame containing input data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

Returns:
DataFrame
Fit result, structured as follows:
  • ID column, with the same name and type as data’s ID column.
  • CLUSTER_ID, type INTEGER, cluster ID assigned to the data point.
  • DISTANCE, type DOUBLE, the distance between the given point and the cluster center.
class hana_ml.algorithms.pal.clustering.KMedoids(conn_context, n_clusters, init=None, max_iter=None, tol=None, thread_ratio=None, distance_level=None, minkowski_power=None, category_weights=None, normalization=None, categorical_variable=None)

Bases: hana_ml.algorithms.pal.clustering._KClusteringBase

K-Medoids clustering algorithm that partitions n observations into K clusters according to their nearest cluster center. It uses medoids to calculate cluster centers. K-Medoids is more robust to noise and outliers.

Parameters:
conn_context : ConnectionContext

Database connection object.

n_clusters : int

Number of groups.

init : {‘first_k’, ‘replace’, ‘no_replace’, ‘patent’}, optional

Controls how the initial centers are selected:

  • ‘first_k’: First k observations.
  • ‘replace’: Random with replacement.
  • ‘no_replace’: Random without replacement.
  • ‘patent’: Patent of selecting the init center (US 6,882,998 B1).

Defaults to patent.

max_iter : int, optional

Max iterations. Defaults to 100.

tol : float, optional

Convergence threshold for exiting iterations. Defaults to 1.0e-6.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.

distance_level : str, optional

Ways to compute the distance between the item and the cluster center.

  • ‘manhattan’
  • ‘euclidean’
  • ‘minkowski’
  • ‘chebyshev’
  • ‘cosine’

Defaults to euclidean.

minkowski_power : float, optional

When Minkowski distance is used, this parameter controls the value of power. Only valid when distance_level is minkowski. Defaults to 3.0.

category_weights : float, optional

Represents the weight of category attributes. Defaults to 0.707.

normalization : {‘no’, ‘l1_norm’, ‘min_max’}, optional

Normalization type.

  • ‘no’: No, normalization will not be applied.
  • ‘l1_norm’: Yes, for each point X (x1,x2,…,xn), the normalized value will be X’(x1/S,x2/S,…,xn/S), where S = |x1|+|x2|+...|xn|.
  • ‘min_max’: Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).

Defaults to no.

categorical_variable : list of str, optional

Column names in the data table used as category variable.

Examples

Input dataframe for clustering:

>>> df1.collect()
    ID  V000 V001  V002
0    0   0.5    A   0.5
1    1   1.5    A   0.5
2    2   1.5    A   1.5
3    3   0.5    A   1.5
4    4   1.1    B   1.2
5    5   0.5    B  15.5
6    6   1.5    B  15.5
7    7   1.5    B  16.5
8    8   0.5    B  16.5
9    9   1.2    C  16.1
10  10  15.5    C  15.5
11  11  16.5    C  15.5
12  12  16.5    C  16.5
13  13  15.5    C  16.5
14  14  15.6    D  16.2
15  15  15.5    D   0.5
16  16  16.5    D   0.5
17  17  16.5    D   1.5
18  18  15.5    D   1.5
19  19  15.7    A   1.6

Creating KMedoids instance:

>>> kmedoids = KMedoids(conn_context=cc, n_clusters=4, init='first_K',
...                     max_iter=100, tol=1.0E-6,
...                     distance_level='Euclidean',
...                     thread_ratio=0.3, category_weights=0.5)

Performing fit() on given dataframe:

>>> kmedoids.fit(df1, 'ID')
>>> kmedoids.cluster_centers_.collect()
   CLUSTER_ID  V000 V001  V002
0           0   1.5    A   1.5
1           1  15.5    D   1.5
2           2  15.5    C  16.5
3           3   1.5    B  16.5

Performing fit_predict() on given dataframe:

>>> kmedoids.fit_predict(df1, 'ID').collect()
    ID  CLUSTER_ID  DISTANCE
0    0           0  1.414214
1    1           0  1.000000
2    2           0  0.000000
3    3           0  1.000000
4    4           0  1.207107
5    5           3  1.414214
6    6           3  1.000000
7    7           3  0.000000
8    8           3  1.000000
9    9           3  1.207107
10  10           2  1.000000
11  11           2  1.414214
12  12           2  1.000000
13  13           2  0.000000
14  14           2  1.023335
15  15           1  1.000000
16  16           1  1.414214
17  17           1  1.000000
18  18           1  0.000000
19  19           1  0.930714
Attributes:
cluster_centers_ : DataFrame

Coordinates of cluster centers.

labels_ : DataFrame

Cluster assignment and distance to cluster center for each point.

Methods

fit(data, key[, features]) Perform clustering on input dataset.
fit_predict(data, key[, features]) Perform clustering algorithm and return labels.
fit(data, key, features=None)

Perform clustering on input dataset.

Parameters:
data : DataFrame

DataFrame contains input data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

fit_predict(data, key, features=None)

Perform clustering algorithm and return labels.

Parameters:
data : DataFrame

DataFrame containing input data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

Returns:
DataFrame
Fit result, structured as follows:
  • ID column, with the same name and type as data’s ID column.
  • CLUSTER_ID, type INTEGER, cluster ID assigned to the data point.
  • DISTANCE, type DOUBLE, the distance between the given point and the cluster center.

hana_ml.algorithms.pal.decomposition

This module contains PAL wrappers for decomposition algorithms.

The following classes are available:

class hana_ml.algorithms.pal.decomposition.LatentDirichletAllocation(conn_context, n_components, doc_topic_prior=None, topic_word_prior=None, burn_in=None, iteration=None, thin=None, seed=None, max_top_words=None, threshold_top_words=None, gibbs_init=None, delimiters=None, output_word_assignment=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Latent Dirichlet allocation (LDA) is a generative model in which each item (word) of a collection (document) is generated from a finite mixture over several latent groups (topics).

Parameters:
conn_context : ConnectionContext

Connection to the HANA system.

n_components : int

Expected number of topics in the corpus.

doc_topic_prior : float, optional

Specifies the prior weight related to document-topic distribution. Defaults to 50/n_components.

topic_word_prior : float, optional

Specifies the prior weight related to topic-word distribution. Defaults to 0.1.

burn_in : int, optional

Number of omitted Gibbs iterations at the beginning. Generally, samples from the beginning may not accurately represent the desired distribution and are usually discarded. Defaults to 0.

iteration : int, optional

Number of Gibbs iterations. Defaults to 2000.

thin : int, optional

Number of omitted in-between Gibbs iterations. Value must be greater than 0. Defaults to 1.

seed : int, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.
  • Not 0: Uses the provided value.

Defaults to 0.

max_top_words : int, optional

Specifies the maximum number of words to be output for each topic. Defaults to 0.

threshold_top_words : float, optional

The algorithm outputs top words for each topic if the probability is larger than this threshold. It cannot be used together with parameter max_top_words.

gibbs_init : str, optional

Specifies initialization method for Gibbs sampling:

  • ‘uniform’: Assign each word in each document a topic by uniform distribution.
  • ‘gibbs’: Assign each word in each document a topic by one round of Gibbs sampling using doc_topic_prior and topic_word_prior.

Defaults to ‘uniform’.

delimiters : list of str, optional

Specifies the set of delimiters to separate words in a document. Each delimiter must be one character long. Defaults to [‘ ‘].

output_word_assignment : bool, optional

Controls whether to output the word_topic_assignment_ or not. If True, output the word_topic_assignment_. Defaults to False.

Notes

  • Parameters max_top_words and threshold_top_words cannot be used together.
  • Parameters burn_in, thin, iteration, seed, gibbs_init and delimiters set in transform() will take precedence over the corresponding ones in __init__().

Examples

Input dataframe for training:

>>> df1.collect()
   DOCUMENT_ID                                               TEXT
0           10  cpu harddisk graphiccard cpu monitor keyboard ...
1           20  tires mountainbike wheels valve helmet mountai...
2           30  carseat toy strollers toy toy spoon toy stroll...
3           40  sweaters sweaters sweaters boots sweaters ring...

Creating LDA instance:

>>> lda = LatentDirichletAllocation(cc, n_components=6, burn_in=50, thin=10,
...                                 iteration=100, seed=1,
...                                 max_top_words=5, doc_topic_prior=0.1,
...                                 output_word_assignment=True,
...                                 delimiters=[' ', '\r', '\n'])

Performing fit() on given dataframe:

>>> lda.fit(df1, 'DOCUMENT_ID', 'TEXT')
>>> lda.doc_topic_dist_.collect()
    DOCUMENT_ID  TOPIC_ID  PROBABILITY
0            10         0     0.010417
1            10         1     0.010417
2            10         2     0.010417
3            10         3     0.010417
4            10         4     0.947917
5            10         5     0.010417
6            20         0     0.009434
7            20         1     0.009434
8            20         2     0.009434
9            20         3     0.952830
10           20         4     0.009434
11           20         5     0.009434
12           30         0     0.103774
13           30         1     0.858491
14           30         2     0.009434
15           30         3     0.009434
16           30         4     0.009434
17           30         5     0.009434
18           40         0     0.009434
19           40         1     0.009434
20           40         2     0.952830
21           40         3     0.009434
22           40         4     0.009434
23           40         5     0.009434
>>> lda.word_topic_assignment_.collect()
    DOCUMENT_ID  WORD_ID  TOPIC_ID
0            10        0         4
1            10        1         4
2            10        2         4
3            10        0         4
4            10        3         4
5            10        4         4
6            10        0         4
7            10        5         4
8            10        5         4
9            20        6         3
10           20        7         3
11           20        8         3
12           20        9         3
13           20       10         3
14           20        7         3
15           20       11         3
16           20        6         3
17           20        7         3
18           20        7         3
19           30       12         1
20           30       13         1
21           30       14         1
22           30       13         1
23           30       13         1
24           30       15         0
25           30       13         1
26           30       14         1
27           30       13         1
28           30       12         1
29           40       16         2
30           40       16         2
31           40       16         2
32           40       17         2
33           40       16         2
34           40       18         2
35           40       19         2
36           40       19         2
37           40       20         2
38           40       16         2
>>> lda.topic_top_words_.collect()
   TOPIC_ID                                       WORDS
0         0     spoon strollers tires graphiccard valve
1         1       toy strollers carseat graphiccard cpu
2         2              sweaters vest shoe rings boots
3         3  mountainbike tires rearfender helmet valve
4         4    cpu memory graphiccard keyboard harddisk
5         5       strollers tires graphiccard cpu valve
>>> lda.topic_word_dist_.head(40).collect()
    TOPIC_ID  WORD_ID  PROBABILITY
0          0        0     0.050000
1          0        1     0.050000
2          0        2     0.050000
3          0        3     0.050000
4          0        4     0.050000
5          0        5     0.050000
6          0        6     0.050000
7          0        7     0.050000
8          0        8     0.550000
9          0        9     0.050000
10         1        0     0.050000
11         1        1     0.050000
12         1        2     0.050000
13         1        3     0.050000
14         1        4     0.050000
15         1        5     0.050000
16         1        6     0.050000
17         1        7     0.050000
18         1        8     0.050000
19         1        9     0.550000
20         2        0     0.025000
21         2        1     0.025000
22         2        2     0.525000
23         2        3     0.025000
24         2        4     0.025000
25         2        5     0.025000
26         2        6     0.025000
27         2        7     0.275000
28         2        8     0.025000
29         2        9     0.025000
30         3        0     0.014286
31         3        1     0.014286
32         3        2     0.014286
33         3        3     0.585714
34         3        4     0.157143
35         3        5     0.014286
36         3        6     0.157143
37         3        7     0.014286
38         3        8     0.014286
39         3        9     0.014286
>>> lda.dictionary_.collect()
    WORD_ID          WORD
0        17         boots
1        12       carseat
2         0           cpu
3         2   graphiccard
4         1      harddisk
5        10        helmet
6         4      keyboard
7         5        memory
8         3       monitor
9         7  mountainbike
10       11    rearfender
11       18         rings
12       20          shoe
13       15         spoon
14       14     strollers
15       16      sweaters
16        6         tires
17       13           toy
18        9         valve
19       19          vest
20        8        wheels
>>> lda.statistic_.collect()
         STAT_NAME          STAT_VALUE
0        DOCUMENTS                   4
1  VOCABULARY_SIZE                  21
2   LOG_LIKELIHOOD  -64.95765414596762

Dataframe to transform:

>>> df2.collect()
   DOCUMENT_ID               TEXT
0           10  toy toy spoon cpu

Performing transform on the given dataframe:

>>> res = lda.transform(df2, 'DOCUMENT_ID', 'TEXT', burn_in=2000, thin=100,
...                     iteration=1000, seed=1,
...                     output_word_assignment=True)
>>> doc_top_df, word_top_df, stat_df = res
>>> doc_top_df.collect()
   DOCUMENT_ID  TOPIC_ID  PROBABILITY
0           10         0     0.239130
1           10         1     0.456522
2           10         2     0.021739
3           10         3     0.021739
4           10         4     0.239130
5           10         5     0.021739
>>> word_top_df.collect()
   DOCUMENT_ID  WORD_ID  TOPIC_ID
0           10       13         1
1           10       13         1
2           10       15         0
3           10        0         4
>>> stat_df.collect()
         STAT_NAME          STAT_VALUE
0        DOCUMENTS                   1
1  VOCABULARY_SIZE                  21
2   LOG_LIKELIHOOD  -7.925092991875363
3       PERPLEXITY   7.251970666272191
Attributes:
doc_topic_dist_ : DataFrame
DOCUMENT_TOPIC_DISTRIBUTION table, structured as follows:
  • Document ID column, with same name and type as data’s document ID column from fit().
  • TOPIC_ID, type INTEGER, topic ID.
  • PROBABILITY, type DOUBLE, probability of topic given document.
word_topic_assignment_ : DataFrame
WORD_TOPIC_ASSIGNMENT table, structured as follows:
  • Document ID column, with same name and type as data’s document ID column from fit().
  • WORD_ID, type INTEGER, word ID.
  • TOPIC_ID, type INTEGER, topic ID.

Set to None if output_word_assignment is set to False.

topic_top_words_ : DataFrame
TOPIC_TOP_WORDS table, structured as follows:
  • TOPIC_ID, type INTEGER, topic ID.
  • WORDS, type NVARCHAR(5000), topic top words separated by spaces.

Set to None if neither max_top_words nor threshold_top_words is provided.

topic_word_dist_ : DataFrame
TOPIC_WORD_DISTRIBUTION table, structured as follows:
  • TOPIC_ID, type INTEGER, topic ID.
  • WORD_ID, type INTEGER, word ID.
  • PROBABILITY, type DOUBLE, probability of word given topic.
dictionary_ : DataFrame
DICTIONARY table, structured as follows:
  • WORD_ID, type INTEGER, word ID.
  • WORD, type NVARCHAR(5000), word text.
statistic_ : DataFrame
STATISTICS table, structured as follows:
  • STAT_NAME, type NVARCHAR(256), statistic name.
  • STAT_VALUE, type NVARCHAR(1000), statistic value.

Methods

fit(data, key[, document]) Fit LDA model based on training data.
fit_transform(data, key[, document]) Fit LDA model based on training data and return the topic assignment for the training documents.
transform(data, key[, document, burn_in, …]) Transform the topic assignment for new documents based on the previous LDA estimation results.
fit(data, key, document=None)

Fit LDA model based on training data.

Parameters:
data : DataFrame

Training data.

key : str

Name of the document ID column.

document : str, optional

Name of the document text column. If document is not provided, data must have exactly 1 non-key column, and document defaults to that column.

fit_transform(data, key, document=None)

Fit LDA model based on training data and return the topic assignment for the training documents.

Parameters:
data : DataFrame

Training data.

key : str

Name of the document ID column.

document : str, optional

Name of the document text column. If document is not provided, data must have exactly 1 non-key column, and document defaults to that column.

Returns:
doc_topic_df : DataFrame
DOCUMENT_TOPIC_DISTRIBUTION table, structured as follows:
  • Document ID column, with same name and type as data’s document ID column.
  • TOPIC_ID, type INTEGER, topic ID.
  • PROBABILITY, type DOUBLE, probability of topic given document.
transform(data, key, document=None, burn_in=None, iteration=None, thin=None, seed=None, gibbs_init=None, delimiters=None, output_word_assignment=None)

Transform the topic assignment for new documents based on the previous LDA estimation results.

Parameters:
data : DataFrame

Independent variable values used for tranform.

key : str

Name of the document ID column.

document : str, optional

Name of the document text column. If document is not provided, data must have exactly 1 non-key column, and document defaults to that column.

burn_in : int, optional

Number of omitted Gibbs iterations at the beginning. Generally, samples from the beginning may not accurately represent the desired distribution and are usually discarded. Defaults to 0 if not set in __init__().

iteration : int, optional

Numbers of Gibbs iterations. Defaults to 2000 if not set in __init__().

thin : int, optional

Number of omitted in-between Gibbs iterations. Defaults to 1 if not set in __init__().

seed : int, optional

Indicates the seed used to initialize the random number generator:

  • 0: Uses the system time.
  • Not 0: Uses the provided value.

Defaults to 0 if not set in __init__().

gibbs_init : str, optional

Specifies initialization method for Gibbs sampling:

  • ‘uniform’: Assign each word in each document a topic by uniform distribution.
  • ‘gibbs’: Assign each word in each document a topic by one round of Gibbs sampling using doc_topic_prior and topic_word_prior.

Defaults to ‘uniform’ if not set in __init__().

delimiters : list of str, optional

Specifies the set of delimiters to separate words in a document. Each delimiter must be one character long. Defaults to [‘ ‘] if not set in __init__().

output_word_assignment : bool, optional

Controls whether to output the word_topic_df or not. If True, output the word_topic_df. Defaults to False.

Returns:
doc_topic_df : DataFrame
DOCUMENT_TOPIC_DISTRIBUTION table, structured as follows:
  • Document ID column, with same name and type as data’s document ID column.
  • TOPIC_ID, type INTEGER, topic ID.
  • PROBABILITY, type DOUBLE, probability of topic given document.
word_topic_df : DataFrame
WORD_TOPIC_ASSIGNMENT table, structured as follows:
  • Document ID column, with same name and type as data’s document ID column.
  • WORD_ID, type INTEGER, word ID.
  • TOPIC_ID, type INTEGER, topic ID.

Set to None if output_word_assignment is False.

stat_df : DataFrame
STATISTICS table, structured as follows:
  • STAT_NAME, type NVARCHAR(256), statistic name.
  • STAT_VALUE, type NVARCHAR(1000), statistic value.
class hana_ml.algorithms.pal.decomposition.PCA(conn_context, scaling=None, thread_ratio=None, scores=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Principal component analysis procedure to reduce the dimensionality of multivariate data using Singular Value Decomposition.

Parameters:
conn_context : ConnectionContext

Connection to the HANA system.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to heuristically determined.

scaling : bool, optional

If true, scale variables to have unit variance before the analysis takes place. Defaults to False.

scores : bool, optional

If true, output the scores on each principal component when fitting. Defaults to False.

Notes

Variables cannot be scaled if there exists one variable which has constant value across data items.

Examples

Input dataframe for training:

>>> df1.head(4).collect()
   ID    X1    X2    X3    X4
0   1  12.0  52.0  20.0  44.0
1   2  12.0  57.0  25.0  45.0
2   3  12.0  54.0  21.0  45.0
3   4  13.0  52.0  21.0  46.0

Creating PCA instance:

>>> pca = PCA(cc, scaling=True, thread_ratio=0.5, scores=True)

Performing fit() on given dataframe:

>>> pca.fit(df1, key='ID')
>>> pca.loadings_.collect()
  COMPONENT_ID  LOADINGS_X1  LOADINGS_X2  LOADINGS_X3  LOADINGS_X4
0        Comp1     0.541547     0.321424     0.511941     0.584235
1        Comp2    -0.454280     0.728287     0.395819    -0.326429
2        Comp3    -0.171426    -0.600095     0.760875    -0.177673
3        Comp4    -0.686273    -0.078552    -0.048095     0.721489
>>> pca.loadings_stat_.collect()
  COMPONENT_ID        SD  VAR_PROP  CUM_VAR_PROP
0        Comp1  1.566624  0.613577      0.613577
1        Comp2  1.100453  0.302749      0.916327
2        Comp3  0.536973  0.072085      0.988412
3        Comp4  0.215297  0.011588      1.000000
>>> pca.scaling_stat_.collect()
   VARIABLE_ID       MEAN     SCALE
0            1  17.000000  5.039841
1            2  53.636364  1.689540
2            3  23.000000  2.000000
3            4  48.454545  4.655398

Input dataframe for transforming:

>>> df2.collect()
   ID    X1    X2    X3    X4
0   1   2.0  32.0  10.0  54.0
1   2   9.0  57.0  20.0  25.0
2   3  12.0  24.0  28.0  35.0
3   4  15.0  42.0  27.0  36.0

Performing transform() on given dataframe:

>>> result = pca.transform(df2, key='ID', n_components=4)
>>> result.collect()
   ID  COMPONENT_1  COMPONENT_2  COMPONENT_3  COMPONENT_4
0   1    -8.359662   -10.936083     3.037744     4.220525
1   2    -3.931082     3.221886    -1.168764    -2.629849
2   3    -6.584040   -10.391291    13.112075    -0.146681
3   4    -2.967768    -3.170720     6.198141    -1.213035
Attributes:
loadings_ : DataFrame

The weights by which each standardized original variable should be multiplied when computing component scores.

loadings_stat_ : DataFrame

Loadings statistics on each component.

scores_ : DataFrame

The transformed variable values corresponding to each data point. Set to None if scores is False.

scaling_stat_ : DataFrame

Mean and scale values of each variable.

Methods

fit(data, key[, features]) Principal component analysis function.
fit_transform(data, key[, features]) Fit with the dataset and return the scores.
transform(data, key[, features, n_components]) Principal component analysis projection function using a trained model.
fit(data, key, features=None)

Principal component analysis function.

Parameters:
data : DataFrame

Data to be analyzed.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

fit_transform(data, key, features=None)

Fit with the dataset and return the scores.

Parameters:
data : DataFrame

Data to be analyzed.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

Returns:
DataFrame

Transformed variable values corresponding to each data point, structured as follows:

  • ID column, with same name and type as data’s ID column.
  • Score columns, type DOUBLE, representing the component score values of each data point.
transform(data, key, features=None, n_components=None)

Principal component analysis projection function using a trained model.

Parameters:
data : DataFrame

Data to be analyzed.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

n_components : int, optional

Number of components to be retained. The value range is from 1 to number of features. Defaults to number of features.

Returns:
DataFrame

Transformed variable values corresponding to each data point, structured as follows:

  • ID column, with same name and type as data’s ID column.
  • Score columns, type DOUBLE, representing the component score values of each data point.

hana_ml.algorithms.pal.linear_model

This module contains PAL wrapper and helper functions for linear model algorithms. The following classes are available:

class hana_ml.algorithms.pal.linear_model.LinearRegression(conn_context, solver=None, var_select=None, intercept=True, alpha_to_enter=None, alpha_to_remove=None, enet_lambda=None, enet_alpha=None, max_iter=None, tol=None, pho=None, stat_inf=False, adjusted_r2=False, dw_test=False, reset_test=None, bp_test=False, ks_test=False, thread_ratio=None, categorical_variable=None, pmml_export=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

A linear regression model, based on PAL_LINEAR_REGRESSION.

Parameters:
conn_context : ConnectionContext

Connection to the HANA system.

solver : {‘QR’, ‘SVD’, ‘CD’, ‘Cholesky’, ‘ADMM’}, optional

Algorithms to use to solve the least square problem. Case-insensitive.

  • ‘QR’: QR decomposition.
  • ‘SVD’: singular value decomposition.
  • ‘CD’: cyclical coordinate descent method.
  • ‘Cholesky’: Cholesky decomposition.
  • ‘ADMM’: alternating direction method of multipliers.

‘CD’ and ‘ADMM’ are supported only when var_select is ‘all’. Defaults to QR decomposition.

var_select : {‘all’, ‘forward’, ‘backward’}

Method to perform variable selection.

  • ‘all’: all variables are included.
  • ‘forward’: forward selection.
  • ‘backward’: backward selection.

‘forward’ and ‘backward’ selection are supported only when solver is ‘QR’, ‘SVD’ or ‘Cholesky’. Defaults to ‘all’.

intercept : bool, optional

If true, include the intercept in the model. Defaults to True.

alpha_to_enter : float, optional

P-value for forward selection. Valid only when var_select is ‘forward’. Defaults to 0.05.

alpha_to_remove : float, optional

P-value for backward selection. Valid only when var_select is ‘backward’. Defaults to 0.1.

enet_lambda : float, optional

Penalized weight. Value should be greater than or equal to 0. Valid only when solver is ‘CD’ or ‘ADMM’.

enet_alpha : float, optional

Elastic net mixing parameter. Ranges from 0 (Ridge penalty) to 1 (LASSO penalty) inclusively. Valid only when solver is ‘CD’ or ‘ADMM’. Defaults to 1.0.

max_iter : int, optional

Maximum number of passes over training data. If convergence is not reached after the specified number of iterations, an error will be generated. Valid only when solver is ‘CD’ or ‘ADMM’. Defaults to 1e5.

tol : float, optional

Convergence threshold for coordinate descent. Valid only when solver is ‘CD’. Defaults to 1.0e-7.

pho : float, optional

Step size for ADMM. Generally, it should be greater than 1. Valid only when solver is ‘ADMM’. Defaults to 1.8.

stat_inf : bool, optional

If true, output t-value and Pr(>|t|) of coefficients. Defaults to False.

adjusted_r2 : bool, optional

If true, include the adjusted R^2 value in statistics. Defaults to False.

dw_test : bool, optional

If true, conduct Durbin-Watson test under null hypothesis that errors do not follow a first order autoregressive process. Not available if elastic net regularization is enabled or intercept is ignored. Defaults to False.

reset_test : int, optional

Specifies the order of Ramsey RESET test. Ramsey RESET test with power of variables ranging from 2 to this value (greater than 1) will be conducted. Value 1 means RESET test will not be conducted. Not available if elastic net regularization is enabled or intercept is ignored. Defaults to 1.

bp_test : bool, optional

If true, conduct Breusch-Pagan test under null hypothesis that homoscedasticity is satisfied. Not available if elastic net regularization is enabled or intercept is ignored. Defaults to False.

ks_test : bool, optional

If true, conduct Kolmogorov-Smirnov normality test under null hypothesis that errors follow a normal distribution. Not available if elastic net regularization is enabled or intercept is ignored. Defaults to False.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Valid only when solver is ‘QR’, ‘CD’, ‘Cholesky’ or ‘ADMM’. Defaults to 0.0.

categorical_variable : list of str, optional

INTEGER columns specified in this list will be treated as categorical data. Other INTEGER columns will be treated as continuous.

pmml_export : {‘no’, ‘multi-row’}, optional

Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.

  • ‘no’ or not provided: No PMML model.
  • ‘multi-row’: Exports a PMML model, splitting it across multiple rows if it doesn’t fit in one.

Prediction does not require a PMML model.

Examples

Training data:

>>> df.collect()
  ID       Y    X1 X2  X3
0  0  -6.879  0.00  A   1
1  1  -3.449  0.50  A   1
2  2   6.635  0.54  B   1
3  3  11.844  1.04  B   1
4  4   2.786  1.50  A   1
5  5   2.389  0.04  B   2
6  6  -0.011  2.00  A   2
7  7   8.839  2.04  B   2
8  8   4.689  1.54  B   1
9  9  -5.507  1.00  A   2

Training the model:

>>> lr = LinearRegression(cc,
...                       thread_ratio=0.5,
...                       categorical_variable=["X3"])
>>> lr.fit(df, key='ID', label='Y')

Prediction:

>>> df2.collect()
   ID     X1 X2  X3
0   0  1.690  B   1
1   1  0.054  B   2
2   2  0.123  A   2
3   3  1.980  A   1
4   4  0.563  A   1
>>> lr.predict(df2, key='ID').collect()
   ID      VALUE
0   0  10.314760
1   1   1.685926
2   2  -7.409561
3   3   2.021592
4   4  -3.122685
Attributes:
coefficients_ : DataFrame

Fitted regression coefficients.

pmml_ : DataFrame

PMML model. Set to None if no PMML model was requested.

fitted_ : DataFrame

Predicted dependent variable values for training data. Set to None if the training data has no row IDs.

statistics_ : DataFrame

Regression-related statistics, such as mean squared error.

Methods

fit(data[, key, features, label]) Fit regression model based on training data.
predict(data, key[, features]) Predict dependent variable values based on fitted model.
score(data, key[, features, label]) Returns the coefficient of determination R^2 of the prediction.
fit(data, key=None, features=None, label=None)

Fit regression model based on training data.

Parameters:
data : DataFrame

Training data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

label : str, optional

Name of the dependent variable. If label is not provided, it defaults to the last column.

predict(data, key, features=None)

Predict dependent variable values based on fitted model.

Parameters:
data : DataFrame

Independent variable values to predict for.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

Returns:
DataFrame
Predicted values, structured as follows:
  • ID column, with same name and type as data’s ID column.
  • VALUE, type DOUBLE, representing predicted values.

Notes

predict() will pass the pmml_ table to PAL as the model representation if there is a pmml_ table, or the coefficients_ table otherwise.

score(data, key, features=None, label=None)

Returns the coefficient of determination R^2 of the prediction.

Parameters:
data : DataFrame

Data on which to assess model performance.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults all non-ID, non-label columns.

label : str, optional

Name of the dependent variable. If ‘label` is not provided, it defaults to the last column.

Returns:
accuracy : float

Returns the coefficient of determination R^2 of the prediction.

Notes

score() will pass the pmml_ table to PAL as the model representation if there is a pmml_ table, or the coefficients_ table otherwise.

class hana_ml.algorithms.pal.linear_model.LogisticRegression(conn_context, multi_class=False, max_iter=None, pmml_export=None, categorical_variable=None, standardize=True, stat_inf=False, solver=None, alpha=None, lamb=None, tol=None, epsilon=None, thread_ratio=None, max_pass_number=None, sgd_batch_number=None, lbfgs_m=None, class_map0=None, class_map1=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Logistic regression model that handles binary-class and multi-class classification problems.

Parameters:
conn_context : ConnectionContext

Connection to the HANA system.

multi_class : bool, optional

If true, perform multi-class classification. Otherwise, there must be only two classes. Defaults to False.

max_iter : int, optional

Maximum number of iterations taken for the solvers to converge. If convergence is not reached after this number, an error will be generated.

  • multi-class: Defaults to 100.
  • binary-class: Defaults to 100000 when solver is cyclical, 1000 when solver is proximal, otherwise 100.
pmml_export : {‘no’, ‘single-row’, ‘multi-row’}, optional

Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.

  • multi-class:
    • ‘no’ or not provided: No PMML model.
    • ‘multi-row’: Exports a PMML model, splitting it across multiple rows if it doesn’t fit in one.
  • binary-class:
    • ‘no’ or not provided: No PMML model.
    • ‘single-row’: Exports a PMML model in a maximum of one row. Fails if the model doesn’t fit in one row.
    • ‘multi-row’: Exports a PMML model, splitting it across multiple rows if it doesn’t fit in one.

Defaults to ‘no’.

categorical_variable : list of str, optional

Column names in the data table used as category variable.

standardize : bool, optional

If true, standardize the data to have zero mean and unit variance. Defaults to True.

stat_inf : bool, optional

If true, proceed with statistical inference. Defaults to False.

solver : {‘newton’, ‘cyclical’, ‘lbfgs’, ‘stochastic’, ‘proximal’}, optional

Optimization algorithm.

  • ‘newton’: Newton iteration method.
  • ‘cyclical’: Cyclical coordinate descent method to fit elastic net regularized logistic regression.
  • ‘lbfgs’: LBFGS method (recommended when having many independent variables).
  • ‘stochastic’: Stochastic gradient descent method (recommended when dealing with very large dataset).
  • ‘proximal’: Proximal gradient descent method to fit elastic net regularized logistic regression.

Only valid when multi_class is False. Defaults to newton.

alpha : float, optional

Elastic net mixing parameter. Only valid when multi_class is False and solver is newton, cyclical, lbfgs or proximal. Defaults to 1.0.

lamb : float, optional

Penalized weight. Only valid when multi_class is False and solver is newton, cyclical, lbfgs or proximal. Defaults to 0.0.

tol : float, optional

Convergence threshold for exiting iterations. Only valid when multi_class is False. Defaults to 1.0e-7 when solver is cyclical, 1.0e-6 otherwise.

epsilon : float, optional

Determines the accuracy with which the solution is to be found. Only valid when multi_class is False and the solver is newton or lbfgs. Defaults to 1.0e-6 when solver is newton, 1.0e-5 when solver is lbfgs.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. thread_ratio cannot be set separately for fit(), predict() and score(). Only valid when multi_class is False. Defaults to 1.0 for fit(), 0.0 for predict() and score().

max_pass_number : int, optional

The maximum number of passes over the data. Only valid when multi_class is False and solver is stochastic. Defaults to 1.

sgd_batch_number : int, optional

The batch number of Stochastic gradient descent. Only valid when multi_class is False and solver is stochastic. Defaults to 1.

lbfgs_m : int, optional

Number of previous updates to keep. Only applicable when multi_class is False and solver is lbfgs. Defaults to 6.

class_map0 : str, optional

Categorical label to map to 0. Only valid when multi_class is False. class_map0 is mandatory when label column type is VARCHAR or NVARCHAR during binary class fit and score.

class_map1 : str, optional

Categorical label to map to 1. Only valid when multi_class is False. class_map1 is mandatory when label column type is VARCHAR or NVARCHAR during binary class fit and score.

Examples

Training data:

>>> df.collect()
   V1     V2  V3  CATEGORY
0   B  2.620   0         1
1   B  2.875   0         1
2   A  2.320   1         1
3   A  3.215   2         0
4   B  3.440   3         0
5   B  3.460   0         0
6   A  3.570   1         0
7   B  3.190   2         0
8   A  3.150   3         0
9   B  3.440   0         0
10  B  3.440   1         0
11  A  4.070   3         0
12  A  3.730   1         0
13  B  3.780   2         0
14  B  5.250   2         0
15  A  5.424   3         0
16  A  5.345   0         0
17  B  2.200   1         1
18  B  1.615   2         1
19  A  1.835   0         1
20  B  2.465   3         0
21  A  3.520   1         0
22  A  3.435   0         0
23  B  3.840   2         0
24  B  3.845   3         0
25  A  1.935   1         1
26  B  2.140   0         1
27  B  1.513   1         1
28  A  3.170   3         1
29  B  2.770   0         1
30  B  3.570   0         1
31  A  2.780   3         1

Create LogisticRegression instance and call fit:

>>> lr = linear_model.LogisticRegression(cc, solver='newton',
...                                      thread_ratio=0.1, max_iter=1000,
...                                      categorical_variable=['V3'],
...                                      pmml_export='single-row',
...                                      stat_inf=True, tol=0.000001)
>>> lr.fit(df, features=['V1', 'V2', 'V3'], label='CATEGORY')
>>> lr.coef_.collect()
        VARIABLE_NAME  COEFFICIENT
0   __PAL_INTERCEPT__    15.579882
1  V1__PAL_DELIMIT__B     0.000000
2  V1__PAL_DELIMIT__A     1.464903
3                  V2    -4.819740
4  V3__PAL_DELIMIT__0     0.000000
5  V3__PAL_DELIMIT__1    -2.794139
6  V3__PAL_DELIMIT__2    -4.807858
7  V3__PAL_DELIMIT__3    -2.780918
>>> pred_df = cc.table('DATA_TBL_PREDICT')
>>> pred_df.collect()
    ID V1     V2  V3
0    0  B  2.620   0
1    1  B  2.875   0
2    2  A  2.320   1
3    3  A  3.215   2
4    4  B  3.440   3
5    5  B  3.460   0
6    6  A  3.570   1
7    7  B  3.190   2
8    8  A  3.150   3
9    9  B  3.440   0
10  10  B  3.440   1
11  11  A  4.070   3
12  12  A  3.730   1
13  13  B  3.780   2
14  14  B  5.250   2
15  15  A  5.424   3
16  16  A  5.345   0
17  17  B  2.200   1

Call predict:

>>> result = lr.predict(pred_df, 'ID', ['V1', 'V2', 'V3'])
>>> result.collect()
    ID CLASS   PROBABILITY
0    0     1  9.503656e-01
1    1     1  8.485314e-01
2    2     1  9.555893e-01
3    3     0  3.702131e-02
4    4     0  2.229288e-02
5    5     0  2.504115e-01
6    6     0  4.946187e-02
7    7     0  9.922804e-03
8    8     0  2.853014e-01
9    9     0  2.689367e-01
10  10     0  2.200654e-02
11  11     0  4.714084e-03
12  12     0  2.349977e-02
13  13     0  5.830852e-04
14  14     0  4.886534e-07
15  15     0  6.938601e-06
16  16     0  1.637959e-04
17  17     1  8.986501e-01
Attributes:
coef_ : DataFrame

Values of the coefficients.

result_ : DataFrame

Model content.

pmml_ : DataFrame

PMML model. Set to None if no PMML model was requested.

Methods

fit(data[, key, features, label]) Fit the LR model when given training dataset.
predict(data, key[, features, verbose]) Predict with the dataset using the trained model.
score(data, key[, features, label]) Return the mean accuracy on the given test data and labels.
fit(data, key=None, features=None, label=None)

Fit the LR model when given training dataset.

Parameters:
data : DataFrame

DataFrame containing the data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

label : str, optional

Name of the label column. If label is not provided, it defaults to the last column.

predict(data, key, features=None, verbose=False)

Predict with the dataset using the trained model.

Parameters:
data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

verbose : bool, optional

If true, output scoring probabilities for each class. It is only applicable for multi-class case. Defaults to False.

Returns:
DataFrame
Predicted result, structured as follows:
  • ID column, with the same name and type as data’s ID column.
  • CLASS, type NVARCHAR, predicted class name.
  • PROBABILITY, type DOUBLE
    • multi-class: probability of being predicted as the predicted class.
    • binary-class: probability of being predicted as the positive class.

Notes

predict() will pass the pmml_ table to PAL as the model representation if there is a pmml_ table, or the result_ table otherwise.

score(data, key, features=None, label=None)

Return the mean accuracy on the given test data and labels.

Parameters:
data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

label : str, optional

Name of the label column. If label is not provided, it defaults to the last column.

Returns:
accuracy : float

Scalar accuracy value after comparing the predicted label and original label.

hana_ml.algorithms.pal.metrics

This module contains PAL wrappers for metrics to assess the quality of model outputs.

The following functions are available:

hana_ml.algorithms.pal.metrics.accuracy_score(conn_context, data, label_true, label_pred)

Compute mean accuracy score for classification results. That is, the proportion of the correctly predicted results among the total number of cases examined.

Parameters:
conn_context : ConnectionContext

HANA connection.

data : DataFrame

DataFrame of true and predicted labels.

label_true : str

Name of the column containing ground truth labels.

label_pred : str

Name of the column containing predicted labels, as returned by a classifier.

Returns:
accuracy : float

Accuracy classification score. A lower accuracy indicates that the classifier was able to predict less of the labels in the input correctly.

Examples

Actual and predicted labels for a hypothetical classification:

>>> df.collect()
   ACTUAL  PREDICTED
0    1        0
1    0        0
2    0        0
3    1        1
4    1        1

Accuracy score for these predictions:

>>> accuracy_score(cc, df, label_true='ACTUAL', label_pred='PREDICTED')
0.8

Compare that to null accuracy (accuracy that could be achieved by always predicting the most frequent class):

>>> df_dummy.collect()
   ACTUAL  PREDICTED
0    1       1
1    0       1
2    0       1
3    1       1
4    1       1
>>> accuracy_score(cc, df, label_true='ACTUAL', label_pred='PREDICTED')
0.6

A perfect predictor:

>>> df_perfect.collect()
   ACTUAL  PREDICTED
0    1       1
1    0       0
2    0       0
3    1       1
4    1       1
>>> accuracy_score(cc, df, label_true='ACTUAL', label_pred='PREDICTED')
1.0
hana_ml.algorithms.pal.metrics.auc(conn_context, data, positive_label=None)

Compute area under curve (AUC) to evaluate the performance of binary-class classification algorithms.

Parameters:
conn_context : ConnectionContext

Connection to the HANA system.

data : DataFrame

Input data, structured as follows: - ID column. - True class of the data point. - Classifier-computed probability that the data point belongs

to the positive class.

positive_label : str, optional

If original label is not 0 or 1, specifies the label value which will be mapped to 1.

Returns:
auc : float

The area under the receiver operating characteristic curve.

roc : DataFrame

False positive rate and true positive rate, structured as follows: - ID column, type INTEGER. - FPR, type DOUBLE, representing false positive rate. - TPR, type DOUBLE, representing true positive rate.

Examples

Input data:

>>> df.collect()
   ID  ORIGINAL  PREDICT
0   1         0     0.07
1   2         0     0.01
2   3         0     0.85
3   4         0     0.30
4   5         0     0.50
5   6         1     0.50
6   7         1     0.20
7   8         1     0.80
8   9         1     0.20
9  10         1     0.95

Compute Area Under Curve:

>>> auc, roc = auc(cc, df)

Ideal output:

>>> print(auc)
 0.66
>>> roc.collect()
   ID  FPR  TPR
0   0  1.0  1.0
1   1  0.8  1.0
2   2  0.6  1.0
3   3  0.6  0.6
4   4  0.4  0.6
5   5  0.2  0.4
6   6  0.2  0.2
7   7  0.0  0.2
8   8  0.0  0.0
hana_ml.algorithms.pal.metrics.confusion_matrix(conn_context, data, key, label_true=None, label_pred=None, beta=None, native=True)

Compute confusion matrix to evaluate the accuracy of a classification.

Parameters:
conn_context : ConnectionContext

Database connection object.

data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

label_true : str, optional

Name of the original label column. If not given, defaults to the second columm.

label_pred : str, optional

Name of the the predicted label column. If not given, defaults to the third columm.

beta : float, optional

Parameter used to compute the F-Beta score. Default value: 1

native : bool, optional

Indicates whether to use native sql statements for confusion matrix calculation. Default value: True

Returns:
confusion_matrix_df : DataFrame
Confusion matrix, structured as follows:
  • Original label, with same name and data type as it is in data.
  • Predicted label, with same name and data type as it is in data.
  • Count, type INTEGER, the number of data points with the corresponding combination of predicted and original label.
The dataframe is sorted by (original label, predicted label) in descending
order.
classification_report_df : DataFrame
Structured as follows:
  • Class, type NVARCHAR(100), class name
  • Recall, type DOUBLE, the recall of each class
  • Precision, type DOUBLE, the precision of each class
  • F_MEASURE, type DOUBLE, the F_measure of each class
  • SUPPORT, type INTEGER, the support - sample number in each class

Examples

Data contains the original label and predict label:

>>> df.collect()
   ID  ORIGINAL  PREDICT
0   1         1        1
1   2         1        1
2   3         1        1
3   4         1        2
4   5         1        1
5   6         2        2
6   7         2        1
7   8         2        2
8   9         2        2
9  10         2        2

Calculate the confusion matrix

>>> cm, cr = confusion_matrix(connection_context, df, 'ID', 'ORIGINAL',
...                           'PREDICT')
>>> cm.collect()
   ORIGINAL  PREDICT  COUNT
0         1        1      4
1         1        2      1
2         2        1      1
3         2        2      4
>>> cr.collect()
  CLASS  RECALL  PRECISION  F_MEASURE  SUPPORT
0     1     0.8        0.8        0.8        5
1     2     0.8        0.8        0.8        5
hana_ml.algorithms.pal.metrics.multiclass_auc(conn_context, data_original, data_predict)

Compute area under curve (AUC) to evaluate the performance of multi-class classification algorithms.

Parameters:
conn_context : ConnectionContext

Connection to the HANA system.

data_original : DataFrame

True class data, structured as follows: - Data point ID column. - True class of the data point.

data_predict : DataFrame

Predicted class data, structured as follows: - Data point ID column. - Possible class. - Classifier-computed probability that the data point belongs

to that particular class.

For each data point ID, there should be one row for each possible class.

Returns:
auc : float

The area under the receiver operating characteristic curve.

roc : DataFrame

False positive rate and true positive rate, structured as follows: - ID column, type INTEGER. - FPR, type DOUBLE, representing false positive rate. - TPR, type DOUBLE, representing true positive rate.

Examples

Input data:

>>> df_original.collect()
   ID  ORIGINAL
0   1         1
1   2         1
2   3         1
3   4         2
4   5         2
5   6         2
6   7         3
7   8         3
8   9         3
9  10         3
>>> df_predict.collect()
    ID  PREDICT  PROB
0    1        1  0.90
1    1        2  0.05
2    1        3  0.05
3    2        1  0.80
4    2        2  0.05
5    2        3  0.15
6    3        1  0.80
7    3        2  0.10
8    3        3  0.10
9    4        1  0.10
10   4        2  0.80
11   4        3  0.10
12   5        1  0.20
13   5        2  0.70
14   5        3  0.10
15   6        1  0.05
16   6        2  0.90
17   6        3  0.05
18   7        1  0.10
19   7        2  0.10
20   7        3  0.80
21   8        1  0.00
22   8        2  0.00
23   8        3  1.00
24   9        1  0.20
25   9        2  0.10
26   9        3  0.70
27  10        1  0.20
28  10        2  0.20
29  10        3  0.60

Compute Area Under Curve:

>>> auc, roc = multiclass_auc(cc, df_original, df_predict)

Ideal output:

>>> print(auc)
1.0
>>> roc.collect()
    ID   FPR  TPR
0    0  1.00  1.0
1    1  0.90  1.0
2    2  0.65  1.0
3    3  0.25  1.0
4    4  0.20  1.0
5    5  0.00  1.0
6    6  0.00  0.9
7    7  0.00  0.7
8    8  0.00  0.3
9    9  0.00  0.1
10  10  0.00  0.0
hana_ml.algorithms.pal.metrics.r2_score(conn_context, data, label_true, label_pred)

Compute coefficient of determination for regression results.

Parameters:
conn_context : ConnectionContext

HANA connection.

data : DataFrame

DataFrame of true and predicted values.

label_true : str

Name of the column containing true values.

label_pred : str

Name of the column containing values predicted by regression.

Returns:
r2 : float

Coefficient of determination. 1.0 indicates an exact match between true and predicted values. A lower coefficient of determination indicates that the regression was able to predict less of the variance in the input. A negative value indicates that the regression performed worse than just taking the mean of the true values and using that for every prediction.

Examples

Actual and predicted values for a hypothetical regression:

>>> df.collect()
   ACTUAL  PREDICTED
0    0.10        0.2
1    0.90        1.0
2    2.10        1.9
3    3.05        3.0
4    4.00        3.5

R^2 score for these predictions:

>>> r2_score(cc, df, label_true='ACTUAL', label_pred='PREDICTED')
0.9685233682514102

Compare that to the score for a perfect predictor:

>>> df_perfect.collect()
   ACTUAL  PREDICTED
0    0.10       0.10
1    0.90       0.90
2    2.10       2.10
3    3.05       3.05
4    4.00       4.00
>>> r2_score(cc, df_perfect, label_true='ACTUAL', label_pred='PREDICTED')
1.0

A naive mean predictor:

>>> df_mean.collect()
   ACTUAL  PREDICTED
0    0.10       2.03
1    0.90       2.03
2    2.10       2.03
3    3.05       2.03
4    4.00       2.03
>>> r2_score(cc, df_mean, label_true='ACTUAL', label_pred='PREDICTED')
0.0

And a really awful predictor:

>>> df_awful.collect()
   ACTUAL  PREDICTED
0    0.10    12345.0
1    0.90    91923.0
2    2.10    -4444.0
3    3.05    -8888.0
4    4.00    -9999.0
>>> r2_score(cc, df_awful, label_true='ACTUAL', label_pred='PREDICTED')
-886477397.139857

hana_ml.algorithms.pal.mixture

This module includes mixture modeling algorithms.

The following classes are available:

class hana_ml.algorithms.pal.mixture.GaussianMixture(conn_context, n_components=None, seeds=None, thread_ratio=None, max_iter=None, categorical_variable=None, category_weight=None, error_tol=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Representation of a Gaussian mixture model probability distribution.

Parameters:
conn_context : ConnectionContext

Connection to the HANA system.

n_components : int

Specifies the number of Gaussian distributions. Either n_components or seeds needs to be provided.

seeds : list of int

Specifies the data (by using sequence number of the data in the data table (starting from 0)) to be used as seeds. Either n_components or seeds needs to be provided.

thread_ratio : float, optional

Controls the proportion of available threads that can be used. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.

max_iter : int, optional

Specifies the maximum number of iterations for the EM algorithm. Default value: 100.

categorical_variable : list of str, optional

Indicates features should be treated as categorical. The default behavior is: - string: categorical - integer and float: continuous VALID only for integer variables; omitted otherwise. Default value detected from input data.

category_weight : float, optional

Represents the weight of category attributes. Defaults to 0.707.

error_tol : float, optional

Specifies the error tolerance, which is the stop condition. Defaults to 1e-5.

Examples

Input dataframe for training:

>>> df1.collect()
    ID     X1     X2  X3
0    0   0.10   0.10   1
1    1   0.11   0.10   1
2    2   0.10   0.11   1
3    3   0.11   0.11   1
4    4   0.12   0.11   1
5    5   0.11   0.12   1
6    6   0.12   0.12   1
7    7   0.12   0.13   1
8    8   0.13   0.12   2
9    9   0.13   0.13   2
10  10   0.13   0.14   2
11  11   0.14   0.13   2
12  12  10.10  10.10   1
13  13  10.11  10.10   1
14  14  10.10  10.11   1
15  15  10.11  10.11   1
16  16  10.11  10.12   2
17  17  10.12  10.11   2
18  18  10.12  10.12   2
19  19  10.12  10.13   2
20  20  10.13  10.12   2
21  21  10.13  10.13   2
22  22  10.13  10.14   2
23  23  10.14  10.13   2

Creating GMM instance:

>>> gmm = GaussianMixture(conn_context=cc, init='n_components', n_components=2,
...                       max_iter=500, error_tol=0.001, thread_ratio=0.5,
...                       categorical_variable=['X3'])

Performing fit() on given dataframe:

>>> gmm.fit(df1, key='ID')
>>> gmm.labels_.head(14).collect()
    ID  CLUSTER_ID  PROBABILITY
0    0           0          1.0
1    0           1          0.0
2    1           0          1.0
3    1           1          0.0
4    2           0          1.0
5    2           1          0.0
6    3           0          1.0
7    3           1          0.0
8    4           0          1.0
9    4           1          0.0
10   5           0          1.0
11   5           1          0.0
12   6           0          1.0
13   6           1          0.0
Attributes:
model_ : DataFrame

Trained model content.

labels_ : DataFrame

Cluster membership probabilties for each data point.

Methods

fit(data, key[, features]) Perform GMM clustering on input dataset.
fit_predict(data, key[, features]) Perform GMM clustering on input dataset and return cluster membership probabilties for each data point.
fit(data, key, features=None)

Perform GMM clustering on input dataset.

Parameters:
data : DataFrame

Data to be clustered.

key : str

Name of the ID column.

features : list of str, optional

List of strings specifying feature columns. If a list of features is not given, all the columns except the ID column are taken as features.

fit_predict(data, key, features=None)

Perform GMM clustering on input dataset and return cluster membership probabilties for each data point.

Parameters:
data : DataFrame

Data to be clustered.

key : str

Name of the ID column.

features : list of str, optional

List of strings specifying feature columns. If a list of features is not given, all the columns except the ID column are taken as features.

Returns:
DataFrame

Cluster membership probabilities.

hana_ml.algorithms.pal.naive_bayes

This module contains wrappers for PAL naive bayes classification.

The following classes are available:

class hana_ml.algorithms.pal.naive_bayes.NaiveBayes(conn_context, alpha=None, discretization=None, model_format=None, categorical_variable=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

A classification model based on Bayes’ theorem.

Parameters:
conn_context : ConnectionContext

Connection to the HANA system.

alpha : float, optional

Laplace smoothing value. Set a positive value to enable Laplace smoothing for categorical variables and use that value as the smoothing parameter. Set value 0 to disable Laplace smoothing.

Defaults to 0.

discretization : {‘no’, ‘supervised’}, optional
Discretize continuous attributes. Case-insensitive.
  • ‘no’ or not provided: disable discretization.
  • ‘supervised’: use supervised discretization on all the continuous attributes.

Defaults to no.

model_format : {‘json’, ‘pmml’}, optional

Controls whether to output the model in JSON format or PMML format. Case-insensitive.

  • ‘json’ or not provided: JSON format.
  • ‘pmml’: PMML format.

Defaults to json.

categorical_variable : list of str, optional

INTEGER columns specified in this list will be treated as categorical data. Other INTEGER columns will be treated as continuous.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads.Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.

Notes

The Laplace value (alpha) is only stored by JSON format models. If the PMML format is chosen, you may need to set the Laplace value (alpha) again in predict() and score().

Examples

Training data:

>>> df1.collect()
  HomeOwner MaritalStatus  AnnualIncome DefaultedBorrower
0       YES        Single         125.0                NO
1        NO       Married         100.0                NO
2        NO        Single          70.0                NO
3       YES       Married         120.0                NO
4        NO      Divorced          95.0               YES
5        NO       Married          60.0                NO
6       YES      Divorced         220.0                NO
7        NO        Single          85.0               YES
8        NO       Married          75.0                NO
9        NO        Single          90.0               YES

Training the model:

>>> nb = NaiveBayes(cc, alpha=1.0, model_format='pmml')
>>> nb.fit(df1)

Prediction:

>>> df2.collect()
   ID HomeOwner MaritalStatus  AnnualIncome
0   0        NO       Married         120.0
1   1       YES       Married         180.0
2   2        NO        Single          90.0
>>> nb.predict(df2, 'ID', alpha=1.0, verbose=True)
   ID CLASS  CONFIDENCE
0   0    NO   -6.572353
1   0   YES  -23.747252
2   1    NO   -7.602221
3   1   YES -169.133547
4   2    NO   -7.133599
5   2   YES   -4.648640
Attributes:
model_ : DataFrame

Trained model content.

Methods

fit(data[, key, features, label]) Fit classification model based on training data.
predict(data, key[, features, alpha, verbose]) Predict based on fitted model.
score(data, key[, features, label, alpha]) Returns the mean accuracy on the given test data and labels.
fit(data, key=None, features=None, label=None)

Fit classification model based on training data.

Parameters:
data : DataFrame

Training data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

label : str, optional

Name of the dependent variable. Defaults to the last column.

predict(data, key, features=None, alpha=None, verbose=None)

Predict based on fitted model.

Parameters:
data : DataFrame

Independent variable values to predict for.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

alpha : float, optional

Laplace smoothing value. Set a positive value to enable Laplace smoothing for categorical variables and use that value as the smoothing parameter. Set value 0 to disable Laplace smoothing.

Defaults to the alpha value in the JSON model, if there is one, or 0 otherwise.

verbose : bool, optional

If true, output all classes and the corresponding confidences for each data point.

Defaults to False.

Returns:
DataFrame
Predicted result, structured as follows:
  • ID column, with the same name and type as data’s ID column.
  • CLASS, type NVARCHAR, predicted class name.
  • CONFIDENCE, type DOUBLE, confidence for the prediction of the sample, which is a logarithmic value of the posterior probabilities.

Notes

A non-zero Laplace value (alpha) is required if there exist discrete category values that only occur in the test set. It can be read from JSON models or from the parameter alpha in predict(). The Laplace value you set here takes precedence over the values read from JSON models.

score(data, key, features=None, label=None, alpha=None)

Returns the mean accuracy on the given test data and labels.

Parameters:
data : DataFrame

Data on which to assess model performance.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

label : str, optional

Name of the dependent variable. Defaults to the last column.

alpha : float, optional

Laplace smoothing value. Set a positive value to enable Laplace smoothing for categorical variables and use that value as the smoothing parameter. Set value 0 to disable Laplace smoothing.

Defaults to the alpha value in the JSON model, if there is one, or 0 otherwise.

Returns:
float

Mean accuracy on the given test data and labels.

hana_ml.algorithms.pal.neighbors

This module contains PAL wrappers for the k-nearest neighbors algorithms.

The following classes are available:

class hana_ml.algorithms.pal.neighbors.KNN(conn_context, n_neighbors=None, thread_ratio=None, voting_type=None, stat_info=True, metric=None, minkowski_power=None, algorithm=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

K-Nearest Neighbor(KNN) model that handles classification problems.

Parameters:
conn_context : ConnectionContext

Connection to the HANA sytem.

n_neighbors : int, optional

Number of nearest neighbors. Defaults to 1.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.

voting_type : {‘majority’, ‘distance-weighted’}, optional

Method used to vote for the most frequent label of the K nearest neighbors. Defaults to distance-weighted.

stat_info : bool, optional

Controls whether to return a statistic information table containing the distance between each point in the prediction set and its k nearest neighbors in the training set. If true, the table will be returned. Defaults to True.

metric : {‘manhattan’, ‘euclidean’, ‘minkowski’, ‘chebyshev’}, optional

Ways to compute the distance between data points. Defaults to euclidean.

minkowski_power : float, optional

When Minkowski is used for metric, this parameter controls the value of power. Only valid when metric is Minkowski. Defaults to 3.0.

algorithm : {‘brute-force’, ‘kd-tree’}, optional

Algorithm used to compute the nearest neighbors. Defaults to brute-force.

Examples

Training data:

>>> df.collect()
   ID      X1      X2  TYPE
0   0     1.0     1.0     2
1   1    10.0    10.0     3
2   2    10.0    11.0     3
3   3    10.0    10.0     3
4   4  1000.0  1000.0     1
5   5  1000.0  1001.0     1
6   6  1000.0   999.0     1
7   7   999.0   999.0     1
8   8   999.0  1000.0     1
9   9  1000.0  1000.0     1

Create KNN instance and call fit:

>>> knn = KNN(connection_context, n_neighbors=3, voting_type='majority',
...           thread_ratio=0.1, stat_info=False)
>>> knn.fit(df, 'ID', features=['X1', 'X2'], label='TYPE')
>>> pred_df = connection_context.table("PAL_KNN_CLASSDATA_TBL")

Call predict:

>>> res, stat = knn.predict(pred_df, "ID")
>>> res.collect()
   ID  TYPE
0   0     3
1   1     3
2   2     3
3   3     1
4   4     1
5   5     1
6   6     1
7   7     1

Methods

fit(data, key[, features, label]) Fit the model when given training set.
predict(data, key[, features]) Predict the class labels for the provided data
score(data, key[, features, label]) Return a scalar accuracy value after comparing the predicted and original label.
fit(data, key, features=None, label=None)

Fit the model when given training set.

Parameters:
data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID and non-label columns.

label : str, optional

Name of the label column. If label is not provided, it defaults to the last column.

predict(data, key, features=None)

Predict the class labels for the provided data

Parameters:
data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

Returns:
result_df : DataFrame
Predicted result, structured as follows:
  • ID column, with same name and type as data’s ID column.
  • Label column, with same name and type as training data’s label column.
nearest_neighbors_df : DataFrame

The distance between each point in data and its k nearest neighbors in the training set. Only returned if stat_info is True. Structured as follows:

  • TEST_ + data’s ID name, with same type as data’s ID column, query data ID.
  • K, type INTEGER, K number.
  • TRAIN_ + training data’s ID name, with same type as training data’s ID column, neighbor point’s ID.
  • DISTANCE, type DOUBLE, distance.
score(data, key, features=None, label=None)

Return a scalar accuracy value after comparing the predicted and original label.

Parameters:
data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID and non-label columns.

label : str, optional

Name of the label column. If label is not provided, it defaults to the last column.

Returns:
accuracy : float

Scalar accuracy value after comparing the predicted label and original label.

hana_ml.algorithms.pal.neural_network

This module contains PAL wrappers for Multi-layer Perceptron algorithms.

The following classes are available:

class hana_ml.algorithms.pal.neural_network.MLPClassifier(conn_context, activation, output_activation, hidden_layer_size, max_iter=None, training_style='stochastic', learning_rate=None, momentum=None, batch_size=None, normalization=None, weight_init=None, categorical_variable=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.neural_network._MLPBase

Multi-layer perceptron (MLP) Classifier.

Parameters:
conn_context : ConnectionContext

Connection to the HANA system.

activation : str

Activation function for the hidden layer:

  • ‘tanh’
  • ‘linear’
  • ‘sigmoid_asymmetric’
  • ‘sigmoid_symmetric’
  • ‘gaussian_asymmetric’
  • ‘gaussian_symmetric’
  • ‘elliot_asymmetric’
  • ‘elliot_symmetric’
  • ‘sin_asymmetric’
  • ‘sin_symmetric’
  • ‘cos_asymmetric’
  • ‘cos_symmetric’
  • ‘relu’
output_activation : str

Activation function for the output layer:

  • ‘tanh’
  • ‘linear’
  • ‘sigmoid_asymmetric’
  • ‘sigmoid_symmetric’
  • ‘gaussian_asymmetric’
  • ‘gaussian_symmetric’
  • ‘elliot_asymmetric’
  • ‘elliot_symmetric’
  • ‘sin_asymmetric’
  • ‘sin_symmetric’
  • ‘cos_asymmetric’
  • ‘cos_symmetric’
  • ‘relu’
hidden_layer_size : tuple of int

Size of each hidden layer.

max_iter : int, optional

Maximum number of iterations. Defaults to 100.

training_style : {‘batch’, ‘stochastic’}, optional

Specifies the training style. Defaults to stochastic.

learning_rate : float, optional

Specifies the learning rate. Only valid when training_style is stochastic.

momentum : float, optional

Specifies the momentum for gradient descent update. Only valid when training_style is stochastic.

batch_size : int, optional

Specifies the size of mini batch. Only valid when training_style is stochastic. Defaults to 1.

normalization : {‘no’, ‘z-transform’, ‘scalar’}, optional

Defaults to no (no normalization).

weight_init :str, optional

Specifies the weight initial value.

  • ‘all-zeros’
  • ‘normal’
  • ‘uniform’
  • ‘variance-scale-normal’
  • ‘variance-scale-uniform’

Defaults to all-zeros.

categorical_variable : list of str, optional

Column names in the data table used as category variable.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.

Examples

Training data:

>>> df = connection_context.table("PAL_TRAIN_MLP_REG_DATA_TBL")
>>> df.collect()
   V000  V001 V002  V003 LABEL
0     1  1.71   AC     0    AA
1    10  1.78   CA     5    AB
2    17  2.36   AA     6    AA
3    12  3.15   AA     2     C
4     7  1.05   CA     3    AB
5     6  1.50   CA     2    AB
6     9  1.97   CA     6     C
7     5  1.26   AA     1    AA
8    12  2.13   AC     4     C
9    18  1.87   AC     6    AA

Training the model:

>>> mlpc = MLPClassifier(connection_context, hidden_layer_size=(10,10),
...                      activation='TANH', output_activation='TANH',
...                      learning_rate=0.001, momentum=0.0001,
...                      training_style='stochastic',max_iter=100,
...                      normalization='z-transform', weight_init='normal',
...                      thread_ratio=0.3, categorical_variable='V003')
>>> mlpc.fit(df)

Training result may look different from the following results due to model randomness.

>>> mlpc.model_.collect()
   ROW_INDEX                                      MODEL_CONTENT
0          1  {"CurrentVersion":"1.0","DataDictionary":[{"da...
1          2  t":0.2700182926188939},{"from":13,"weight":0.0...
2          3  ht":0.2414416413305134},{"from":21,"weight":0....
>>> mlpc.train_log_.collect()
    ITERATION     ERROR
0           1  1.080261
1           2  1.008358
2           3  0.947069
3           4  0.894585
4           5  0.849411
5           6  0.810309
6           7  0.776256
7           8  0.746413
8           9  0.720093
9          10  0.696737
10         11  0.675886
11         12  0.657166
12         13  0.640270
13         14  0.624943
14         15  0.609432
15         16  0.595204
16         17  0.582101
17         18  0.569990
18         19  0.558757
19         20  0.548305
20         21  0.538553
21         22  0.529429
22         23  0.521457
23         24  0.513893
24         25  0.506704
25         26  0.499861
26         27  0.493338
27         28  0.487111
28         29  0.481159
29         30  0.475462
..        ...       ...
70         71  0.349684
71         72  0.347798
72         73  0.345954
73         74  0.344071
74         75  0.342232
75         76  0.340597
76         77  0.338837
77         78  0.337236
78         79  0.335749
79         80  0.334296
80         81  0.332759
81         82  0.331255
82         83  0.329810
83         84  0.328367
84         85  0.326952
85         86  0.325566
86         87  0.324232
87         88  0.322899
88         89  0.321593
89         90  0.320242
90         91  0.318985
91         92  0.317840
92         93  0.316630
93         94  0.315376
94         95  0.314210
95         96  0.313066
96         97  0.312021
97         98  0.310916
98         99  0.309770
99        100  0.308704

[100 rows x 2 columns]

Prediction:

>>> pred_df = connection_context.table("PAL_PREDICT_MLP_CLS_DATA_TBL")
>>> res, stat = mlpc.predict(pred_df, 'ID')

Prediction result may look different from the following results due to model randomness.

>>> res.collect()
   ID TARGET     VALUE
0   1      C  0.472751
1   2      C  0.417681
2   3      C  0.543967
>>> stat.collect()
   ID CLASS  SOFT_MAX
0   1    AA  0.371996
1   1    AB  0.155253
2   1     C  0.472751
3   2    AA  0.357822
4   2    AB  0.224496
5   2     C  0.417681
6   3    AA  0.349813
7   3    AB  0.106220
8   3     C  0.543967
Attributes:
model_ : DataFrame

Model content.

train_log_ : DataFrame

Provides mean squared error between predicted values and target values for each iteration.

Methods

fit(data[, key, features, label]) Fit the model when given training dataset.
predict(data, key[, features]) Predict using the multi-layer perceptron model.
score(data, key[, features, label]) Returns the accuracy on the given test data and labels.
fit(data, key=None, features=None, label=None)

Fit the model when given training dataset.

Parameters:
data : DataFrame

DataFrame containing the data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID and non-label columns.

label : str, optional

Name of the label column. If label is not provided, it defaults to the last column.

predict(data, key, features=None)

Predict using the multi-layer perceptron model.

Parameters:
data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

Returns:
result_df : DataFrame
Predicted classes, structured as follows:
  • ID column, with the same name and type as data’s ID column.
  • TARGET, type NVARCHAR, predicted class name.
  • VALUE, type DOUBLE, softmax value for the predicted class.
softmax_df : DataFrame
Softmax values for all classes, structured as follows:
  • ID column, with the same name and type as data’s ID column.
  • CLASS, type NVARCHAR, class name.
  • VALUE, type DOUBLE, softmax value for that class.
score(data, key, features=None, label=None)

Returns the accuracy on the given test data and labels.

Parameters:
data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID and non-label columns.

label : str, optional

Name of the label column. If label is not provided, it defaults to the last column.

Returns:
accuracy : float

Scalar value of accuracy after comparing the predicted result and original label.

class hana_ml.algorithms.pal.neural_network.MLPRegressor(conn_context, activation, output_activation, hidden_layer_size, max_iter=None, training_style='stochastic', learning_rate=None, momentum=None, batch_size=None, normalization=None, weight_init=None, categorical_variable=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.neural_network._MLPBase

Multi-layer perceptron (MLP) Regressor.

Parameters:
conn_context : ConnectionContext

Connection to the HANA system.

activation : str

Activation function for the hidden layer:

  • ‘tanh’
  • ‘linear’
  • ‘sigmoid_asymmetric’
  • ‘sigmoid_symmetric’
  • ‘gaussian_asymmetric’
  • ‘gaussian_symmetric’
  • ‘elliot_asymmetric’
  • ‘elliot_symmetric’
  • ‘sin_asymmetric’
  • ‘sin_symmetric’
  • ‘cos_asymmetric’
  • ‘cos_symmetric’
  • ‘relu’
output_activation : str

Activation function for the output layer:

  • ‘tanh’
  • ‘linear’
  • ‘sigmoid_asymmetric’
  • ‘sigmoid_symmetric’
  • ‘gaussian_asymmetric’
  • ‘gaussian_symmetric’
  • ‘elliot_asymmetric’
  • ‘elliot_symmetric’
  • ‘sin_asymmetric’
  • ‘sin_symmetric’
  • ‘cos_asymmetric’
  • ‘cos_symmetric’
  • ‘relu’
hidden_layer_size : tuple of int

Size of each hidden layer.

max_iter : int, optional

Maximum number of iterations. Defaults to 100.

training_style : {‘batch’, ‘stochastic’}, optional

Specifies the training style. Defaults to stochastic.

learning_rate : float, optional

Specifies the learning rate. Only valid when training_style is stochastic.

momentum : float, optional

Specifies the momentum for gradient descent update. Only valid when training_style is stochastic.

batch_size : int, optional

Specifies the size of mini batch. Only valid when training_style is stochastic. Defaults to 1.

normalization : {‘no’, ‘z-transform’, ‘scalar’}, optional

Defaults to no (no normalization).

weight_init : str, optional

Specifies the weight initial value.

  • ‘all-zeros’
  • ‘normal’
  • ‘uniform’
  • ‘variance-scale-normal’
  • ‘variance-scale-uniform’

Defaults to all-zeros.

categorical_variable : list of str, optional

Column names in the data table used as category variable.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.

Examples

Training data:

>>> df = connection_context.table("PAL_TRAIN_MLP_REG_DATA_TBL")
>>> df.collect()
   V000  V001 V002  V003  T001  T002  T003
0     1  1.71   AC     0  12.7   2.8  3.06
1    10  1.78   CA     5  12.1   8.0  2.65
2    17  2.36   AA     6  10.1   2.8  3.24
3    12  3.15   AA     2  28.1   5.6  2.24
4     7  1.05   CA     3  19.8   7.1  1.98
5     6  1.50   CA     2  23.2   4.9  2.12
6     9  1.97   CA     6  24.5   4.2  1.05
7     5  1.26   AA     1  13.6   5.1  2.78
8    12  2.13   AC     4  13.2   1.9  1.34
9    18  1.87   AC     6  25.5   3.6  2.14

Training the model:

>>> mlpr = MLPRegressor(connection_context, hidden_layer_size=(10,5),
...                     activation='SIN_ASYMMETRIC',
...                     output_activation='SIN_ASYMMETRIC',
...                     learning_rate=0.001, momentum=0.00001,
...                     training_style='batch',
...                     max_iter=10000, normalization='z-transform',
...                     weight_init='normal', thread_ratio=0.3)
>>> mlpr.fit(df, label=['T001', 'T002', 'T003'])

Training result may look different from the following results due to model randomness.

>>> mlpr.model_.collect()
   ROW_INDEX                                      MODEL_CONTENT
0          1  {"CurrentVersion":"1.0","DataDictionary":[{"da...
1          2  3782583596893},{"from":10,"weight":-0.16532599...
>>> mlpr.train_log_.collect()
     ITERATION       ERROR
0            1   34.525655
1            2   82.656301
2            3   67.289241
3            4  162.768062
4            5   38.988242
5            6  142.239468
6            7   34.467742
7            8   31.050946
8            9   30.863581
9           10   30.078204
10          11   26.671436
11          12   28.078312
12          13   27.243226
13          14   26.916686
14          15   26.782915
15          16   26.724266
16          17   26.697108
17          18   26.684084
18          19   26.677713
19          20   26.674563
20          21   26.672997
21          22   26.672216
22          23   26.671826
23          24   26.671631
24          25   26.671533
25          26   26.671485
26          27   26.671460
27          28   26.671448
28          29   26.671442
29          30   26.671439
..         ...         ...
705        706   11.891081
706        707   11.891081
707        708   11.891081
708        709   11.891081
709        710   11.891081
710        711   11.891081
711        712   11.891081
712        713   11.891081
713        714   11.891081
714        715   11.891081
715        716   11.891081
716        717   11.891081
717        718   11.891081
718        719   11.891081
719        720   11.891081
720        721   11.891081
721        722   11.891081
722        723   11.891081
723        724   11.891081
724        725   11.891081
725        726   11.891081
726        727   11.891081
727        728   11.891081
728        729   11.891081
729        730   11.891081
730        731   11.891081
731        732   11.891081
732        733   11.891081
733        734   11.891081
734        735   11.891081

[735 rows x 2 columns]

>>> pred_df = connection_context.table("PAL_PREDICT_MLP_REG_DATA_TBL")
>>> pred_df.collect()
   ID  V000  V001 V002  V003
0   1     1  1.71   AC     0
1   2    10  1.78   CA     5
2   3    17  2.36   AA     6

Prediction:

>>> res  = mlpr.predict(pred_df, 'ID')

Result may look different from the following results due to model randomness.

>>> res.collect()
   ID TARGET      VALUE
0   1   T001  12.700012
1   1   T002   2.799133
2   1   T003   2.190000
3   2   T001  12.099740
4   2   T002   6.100000
5   2   T003   2.190000
6   3   T001  10.099961
7   3   T002   2.799659
8   3   T003   2.190000
Attributes:
model_ : DataFrame

Model content.

train_log_ : DataFrame

Provides mean squared error between predicted values and target values for each iteration.

Methods

fit(data[, key, features, label]) Fit the model when given training dataset.
predict(data, key[, features]) Predict using the multi-layer perceptron model.
score(data, key[, features, label]) Returns the coefficient of determination R^2 of the prediction.
fit(data, key=None, features=None, label=None)

Fit the model when given training dataset.

Parameters:
data : DataFrame

DataFrame containing the data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID and non-label columns.

label : str or list of str, optional

Name of the label column, or list of names of multiple label columns. If label is not provided, it defaults to the last column.

predict(data, key, features=None)

Predict using the multi-layer perceptron model.

Parameters:
data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

Returns:
DataFrame
Predicted results, structured as follows:
  • ID column, with the same name and type as data’s ID column.
  • TARGET, type NVARCHAR, target name.
  • VALUE, type DOUBLE, regression value.
score(data, key, features=None, label=None)

Returns the coefficient of determination R^2 of the prediction.

Parameters:
data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID and non-label columns.

label : str or list of str, optional

Name of the label column, or list of names of multiple label columns. If label is not provided, it defaults to the last column.

Returns:
accuracy : float

Returns the coefficient of determination R^2 of the prediction.

hana_ml.algorithms.pal.preprocessing

This module contains PAL wrappers for preprocessing algorithms.

The following classes are available:

class hana_ml.algorithms.pal.preprocessing.FeatureNormalizer(conn_context, method, z_score_method=None, new_max=None, new_min=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Normalize a dataframe.

Parameters:
conn_context : ConnectionContext

Connection to the HANA system.

method : {‘min-max’, ‘z-score’, ‘decimal’}
Scaling methods:
  • ‘min-max’: Min-max normalization
  • ‘z-score’: Z-Score normalization
  • ‘decimal’: Decimal scaling normalization
z_score_method : {‘mean-standard’, ‘mean-mean’, ‘median-median’}, optional
Only valid when method is ‘z-score’.
  • ‘mean-standard’: Mean-Standard deviation
  • ‘mean-mean’: Mean-Mean deviation
  • ‘median-median’: Median-Median absolute deviation
new_max : float, optional

The new maximum value for min-max normalization. Only valid when method is ‘min-max’.

new_min : float, optional

The new minimum value for min-max normalization. Only valid when method is ‘min-max’.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Does not affect transform(). Defaults to 0.

Examples

Input dataframe for training:

>>> df1.head(4).collect()
    ID    X1    X2
0    0   6.0   9.0
1    1  12.1   8.3
2    2  13.5  15.3
3    3  15.4  18.7

Creating FeatureNormalizer instance:

>>> fn = FeatureNormalizer(cc, method="min-max", new_max=1.0, new_min=0.0)

Performing fit() on given dataframe:

>>> fn.fit(df1, key='ID')
>>> fn.result_.head(4).collect()
    ID        X1        X2
0    0  0.000000  0.033175
1    1  0.186544  0.000000
2    2  0.229358  0.331754
3    3  0.287462  0.492891

Input dataframe for transforming:

>>> df2.collect()
   ID  S_X1  S_X2
0   0   6.0   9.0
1   1   6.0   7.0
2   2   4.0   4.0
3   3   1.0   2.0
4   4   9.0  -2.0
5   5   4.0   5.0

Performing transform() on given dataframe:

>>> result = fn.transform(df2, key='ID')
>>> result.collect()
   ID      S_X1      S_X2
0   0  0.000000  0.033175
1   1  0.000000 -0.061611
2   2 -0.061162 -0.203791
3   3 -0.152905 -0.298578
4   4  0.091743 -0.488152
5   5 -0.061162 -0.156398
Attributes:
result_ : DataFrame

Scaled dataset from fit() and fit_transform().

model_ :

Trained model content.

Methods

fit(data, key[, features]) Normalize input data and generate a scaling model using one of the three scaling methods: min-max normalization, z-score normalization and normalization by decimal scaling.
fit_transform(data, key[, features]) Fit with the dataset and return the results.
transform(data, key[, features]) Scales data based on the previous scaling model.
fit(data, key, features=None)

Normalize input data and generate a scaling model using one of the three scaling methods: min-max normalization, z-score normalization and normalization by decimal scaling.

Parameters:
data : DataFrame

DataFrame to be normalized.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

fit_transform(data, key, features=None)

Fit with the dataset and return the results.

Parameters:
data : DataFrame

DataFrame to be normalized.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

Returns:
DataFrame

Normalized result, with the same structure as data.

transform(data, key, features=None)

Scales data based on the previous scaling model.

Parameters:
data : DataFrame

DataFrame to be normalized.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

Returns:
DataFrame

Normalized result, with the same structure as data.

class hana_ml.algorithms.pal.preprocessing.KBinsDiscretizer(conn_context, strategy, smoothing, n_bins=None, bin_size=None, n_sd=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Bin continuous data into number of intervals and perform local smoothing.

Parameters:
conn_context : ConnectionContext

Connection to the HANA system.

strategy : {‘uniform_number’, ‘uniform_size’, ‘quantile’, ‘sd’}
Binning methods:
  • ‘uniform_number’: Equal widths based on the number of bins.
  • ‘uniform_size’: Equal widths based on the bin size.
  • ‘quantile’: Equal number of records per bin.
  • ‘sd’: Bins are divided based on the distance from the mean. Most bins are one standard deviation wide, except that the center bin contains all values within one standard deviation from the mean, and the leftmost and rightmost bins contain all values more than n_sd standard deviations from the mean in the corresponding directions.
smoothing : {‘means’, ‘medians’, ‘boundaries’}
Smoothing methods:
  • ‘means’: Each value within a bin is replaced by the average of all the values belonging to the same bin.
  • ‘medians’: Each value in a bin is replaced by the median of all the values belonging to the same bin.
  • ‘boundaries’: The minimum and maximum values in a given bin are identified as the bin boundaries. Each value in the bin is then replaced by its closest boundary value. When the distance is equal to both sides, it will be replaced by the front boundary value.

Values used for smoothing are not re-calculated during transform().

n_bins : int, optional

The number of bins. Only valid when strategy is ‘uniform_number’ or ‘quantile’. Defaults to 2.

bin_size : int, optional

The interval width of each bin. Only valid when strategy is ‘uniform_size’. Defaults to 10.

n_sd : int, optional

The leftmost bin contains all values located further than n_sd standard deviations lower than the mean, and the rightmost bin contains all values located further than n_sd standard deviations above the mean. Only valid when strategy is ‘sd’. Defaults to 1.

Examples

Input dataframe for fitting:

>>> df1.collect()
    ID  DATA
0    0   6.0
1    1  12.0
2    2  13.0
3    3  15.0
4    4  10.0
5    5  23.0
6    6  24.0
7    7  30.0
8    8  32.0
9    9  25.0
10  10  38.0

Creating KBinsDiscretizer instance:

>>> binning = KBinsDiscretizer(cc, strategy='uniform_size',
...                          smoothing='means',
...                          bin_size=10)

Performing fit() on the given dataframe:

>>> binning.fit(df1, key='ID')
>>> binning.result_.collect()
    ID  BIN_INDEX       DATA
0    0          1   8.000000
1    1          2  13.333333
2    2          2  13.333333
3    3          2  13.333333
4    4          1   8.000000
5    5          3  25.500000
6    6          3  25.500000
7    7          3  25.500000
8    8          4  35.000000
9    9          3  25.500000
10  10          4  35.000000

Input dataframe for transforming:

>>> df2.collect()
   ID  DATA
0   0   6.0
1   1  67.0
2   2   4.0
3   3  12.0
4   4  -2.0
5   5  40.0

Performing transform() on the given dataframe:

>>> result = binning.transform(df2, key='ID')
>>> result.collect()
   ID  BIN_INDEX       DATA
0   0          1   8.000000
1   1         -1  67.000000
2   2          1   8.000000
3   3          2  13.333333
4   4          1   8.000000
5   5          4  35.000000
Attributes:
result_ : DataFrame

Binned dataset from fit() and fit_transform().

model_ :

Binning model content.

Methods

fit(data, key[, features]) Bin input data into number of intervals and smooth.
fit_transform(data, key[, features]) Fit with the dataset and return the results.
transform(data, key[, features]) Bin data based on the previous binning model.
fit(data, key, features=None)

Bin input data into number of intervals and smooth.

Parameters:
data : DataFrame

DataFrame to be discretized.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. Since the underlying PAL_BINNING only supports one feature, this list can only contain one element. If features is not provided, data must have exactly 1 non-ID column, and features defaults to that column.

fit_transform(data, key, features=None)

Fit with the dataset and return the results.

Parameters:
data : DataFrame

DataFrame to be binned.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. Since the underlying PAL_BINNING only supports one feature, this list can only contain one element. If features is not provided, data must have exactly 1 non-ID column, and features defaults to that column.

Returns:
DataFrame
Binned result, structured as follows:
  • DATA_ID column, with same name and type as data’s ID column.
  • BIN_INDEX, type INTEGER, assigned bin index.
  • BINNING_DATA column, smoothed value, with same name and type as data’s feature column.
transform(data, key, features=None)

Bin data based on the previous binning model.

Parameters:
data : DataFrame

DataFrame to be binned.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. Since the underlying PAL_BINNING_ASSIGNMENT only supports one feature, this list can only contain one element. If features is not provided, data must have exactly 1 non-ID column, and features defaults to that column.

Returns:
DataFrame
Binned result, structured as follows:
  • DATA_ID column, with same name and type as data’s ID column.
  • BIN_INDEX, type INTEGER, assigned bin index.
  • BINNING_DATA column, smoothed value, with same name and type as data’s feature column.

hana_ml.algorithms.pal.regression

This module contains wrappers for PAL regression algorithms.

The following classes are available:

class hana_ml.algorithms.pal.regression.GLM(conn_context, family=None, link=None, solver=None, handle_missing_fit=None, quasilikelihood=None, max_iter=None, tol=None, significance_level=None, output_fitted=None, alpha=None, num_lambda=None, lambda_min_ratio=None, categorical_variable=None, ordering=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

Regression by a generalized linear model, based on PAL_GLM. Also supports ordinal regression.

Parameters:
conn_context : ConnectionContext

Connection to the HANA system.

family : str, optional

The kind of distribution the dependent variable outcomes are assumed to be drawn from. Must be one of the following:

  • ‘gaussian’
  • ‘normal’ (synonym of ‘gaussian’)
  • ‘poisson’
  • ‘binomial’
  • ‘gamma’
  • ‘inversegaussian’
  • ‘negativebinomial’
  • ‘ordinal’ (for ordinal regression)

Defaults to ‘gaussian’.

link : str, optional

GLM link function. Determines the relationship between the linear predictor and the predicted response. Default and allowed values depend on family. ‘inverse’ is accepted as a synonym of ‘reciprocal’.

family default link allowed values of link
gaussian identity identity, log, reciprocal
poisson log identity, log
binomial logit logit, probit, comploglog, log
gamma reciprocal identity, reciprocal, log
inversegaussian inversesquare inversesquare, identity, reciprocal, log
negativebinomial log identity, log, sqrt
ordinal logit logit, probit, comploglog
solver : {‘irls’, ‘nr’, ‘cd’}, optional

Optimization algorithm to use.

  • ‘irls’: Iteratively re-weighted least squares.
  • ‘nr’: Newton-Raphson.
  • ‘cd’: Coordinate descent. (Picking coordinate descent activates elastic net regularization.)

Defaults to ‘irls’, except when family is ‘ordinal’. Ordinal regression requires (and defaults to) ‘nr’, and Newton-Raphson is not supported for other values of family.

handle_missing_fit : {‘skip’, ‘abort’, ‘fill_zero’}, optional

How to handle data rows with missing independent variable values during fitting.

  • ‘skip’: Don’t use those rows for fitting.
  • ‘abort’: Throw an error if missing independent variable values are found.
  • ‘fill_zero’: Replace missing values with 0.

Defaults to ‘skip’.

quasilikelihood : bool, optional

If True, enables the use of quasi-likelihood to estimate overdispersion. Defaults to False.

max_iter : int, optional

Maximum number of optimization iterations. Defaults to 100 for IRLS and Newton-Raphson. Defaults to 100000 for coordinate descent.

tol : float, optional

Stopping condition for optimization. Defaults to 1e-8 for IRLS, 1e-6 for Newton-Raphson, and 1e-7 for coordinate descent.

significance_level : float, optional

Significance level for confidence intervals and prediction intervals. Defaults to 0.05.

output_fitted : bool, optional

If True, create the fitted_ DataFrame of fitted response values for training data in fit.

alpha : float, optional

Elastic net mixing parameter. Only accepted when using coordinate descent. Should be between 0 and 1 inclusive. Defaults to 1.0.

num_lambda : int, optional

The number of lambda values. Only accepted when using coordinate descent. Defaults to 100.

lambda_min_ratio : float, optional

The smallest value of lambda, as a fraction of the maximum lambda, where lambda_max is the smallest value for which all coefficients are zero. Only accepted when using coordinate descent. Defaults to 0.01 when the number of observations is smaller than the number of covariates, and 0.0001 otherwise.

categorical_variable : list of str, optional

INTEGER columns specified in this list will be treated as categorical data. Other INTEGER columns will be treated as continuous.

ordering : list of str or list of int, optional

Specifies the order of categories for ordinal regression. The default is numeric order for ints and alphabetical order for strings.

Examples

Training data:

>>> df.collect()
   ID  Y  X
0   1  0 -1
1   2  0 -1
2   3  1  0
3   4  1  0
4   5  1  0
5   6  1  0
6   7  2  1
7   8  2  1
8   9  2  1

Fitting a GLM on that data:

>>> glm = GLM(cc, solver='irls', family='poisson', link='log')
>>> glm.fit(df, key='ID', label='Y')

Performing prediction:

>>> df2.collect()
   ID  X
0   1 -1
1   2  0
2   3  1
3   4  2
>>> glm.predict(df2, key='ID')[['ID', 'PREDICTION']].collect()
   ID           PREDICTION
0   1  0.25543735346197155
1   2    0.744562646538029
2   3   2.1702915689746476
3   4     6.32608352871737
Attributes:
statistics_ : DataFrame

Training statistics and model information other than the coefficients and covariance matrix.

coef_ : DataFrame

Model coefficients.

covmat_ : DataFrame

Covariance matrix. Set to None for coordinate descent.

fitted_ : DataFrame

Predicted values for the training data. Set to None if output_fitted is False.

Methods

fit(data[, key, features, label]) Fit a generalized linear model based on training data.
predict(data, key[, features, …]) Predict dependent variable values based on fitted model.
score(data, key[, features, label, …]) Returns the coefficient of determination R^2 of the prediction.
fit(data, key=None, features=None, label=None)

Fit a generalized linear model based on training data.

Parameters:
data : DataFrame

Training data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column. Required when output_fitted is True.

features : list of str, optional

Names of the feature columns. Defaults to all non-ID, non-label columns.

label : str or list of str, optional

Name of the dependent variable. Defaults to the last column. (This is not the PAL default.) When family is ‘binomial’, label may be either a single column name or a list of two column names.

predict(data, key, features=None, prediction_type=None, significance_level=None, handle_missing=None)

Predict dependent variable values based on fitted model.

Parameters:
data : DataFrame

Independent variable values to predict for.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. Defaults to all non-ID columns.

prediction_type : {‘response’, ‘link’}, optional

Specifies whether to output predicted values of the response or the link function. Defaults to ‘response’.

significance_level : float, optional

Significance level for confidence intervals and prediction intervals. If specified, overrides the value passed to the GLM constructor.

handle_missing : {‘skip’, ‘fill_zero’}, optional

How to handle data rows with missing independent variable values.

  • ‘skip’: Don’t perform prediction for those rows.
  • ‘fill_zero’: Replace missing values with 0.

Defaults to ‘skip’.

Returns:
DataFrame

Predicted values, structured as follows. The following two columns are always populated:

  • ID column, with same name and type as data’s ID column.
  • PREDICTION, type NVARCHAR(100), representing predicted values.
The following five columns are only populated for IRLS:
  • SE, type DOUBLE. Standard error, or for ordinal regression, the probability that the data point belongs to the predicted category.
  • CI_LOWER, type DOUBLE. Lower bound of the confidence interval.
  • CI_UPPER, type DOUBLE. Upper bound of the confidence interval.
  • PI_LOWER, type DOUBLE. Lower bound of the prediction interval.
  • PI_UPPER, type DOUBLE. Upper bound of the prediction interval.
score(data, key, features=None, label=None, prediction_type=None, handle_missing=None)

Returns the coefficient of determination R^2 of the prediction.

Not applicable for ordinal regression.

Parameters:
data : DataFrame

Data on which to assess model performance.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. Defaults to all non-ID, non-label columns.

label : str, optional

Name of the dependent variable. Defaults to the last column. (This is not the PAL default.) Cannot be two columns, even for family=’binomial’.

prediction_type : {‘response’, ‘link’}, optional

Specifies whether to predict the value of the response or the link function. The contents of the label column should match this choice. Defaults to ‘response’.

handle_missing : {‘skip’, ‘fill_zero’}, optional

How to handle data rows with missing independent variable values.

  • ‘skip’: Don’t perform prediction for those rows. Those rows will be left out of the R^2 computation.
  • ‘fill_zero’: Replace missing values with 0.

Defaults to ‘skip’.

Returns:
accuracy : float

The coefficient of determination R^2 of the prediction on the given data.

class hana_ml.algorithms.pal.regression.PolynomialRegression(conn_context, degree, decomposition=None, adjusted_r2=None, pmml_export=None, thread_ratio=None)

Bases: hana_ml.algorithms.pal.pal_base.PALBase

A univariate polynomial regression model, based on PAL_POLYNOMIAL_REGRESSION.

Parameters:
conn_context : ConnectionContext

Connection to the HANA system.

degree : int

Degree of the polynomial model.

decomposition : {‘LU’, ‘SVD’}, optional
Matrix factorization type to use. Case-insensitive.
  • ‘LU’: LU decomposition.
  • ‘SVD’: singular value decomposition.

Defaults to LU decomposition.

adjusted_r2 : boolean, optional

If true, include the adjusted R^2 value in the statistics table. Defaults to False.

pmml_export : {‘no’, ‘single-row’, ‘multi-row’}, optional

Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.

  • ‘no’ or not provided: No PMML model.
  • ‘single-row’: Exports a PMML model in a maximum of one row. Fails if the model doesn’t fit in one row.
  • ‘multi-row’: Exports a PMML model, splitting it across multiple rows if it doesn’t fit in one.

Prediction does not require a PMML model.

thread_ratio : float, optional

Controls the proportion of available threads to use for prediction. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Does not affect fitting. Defaults to 0.

Examples

Training data (based on y = x^3 - 2x^2 + 3x + 5, with noise):

>>> df.collect()
   ID    X       Y
0   1  0.0   5.048
1   2  1.0   7.045
2   3  2.0  11.003
3   4  3.0  23.072
4   5  4.0  49.041

Training the model:

>>> pr = PolynomialRegression(cc, degree=3)
>>> pr.fit(df, key='ID')

Prediction:

>>> df2.collect()
   ID    X
0   1  0.5
1   2  1.5
2   3  2.5
3   4  3.5
>>> pr.predict(df2, key='ID').collect()
   ID      VALUE
0   1   6.157063
1   2   8.401269
2   3  15.668581
3   4  33.928501

Ideal output:

>>> df2.select('ID', ('POWER(X, 3)-2*POWER(X, 2)+3*x+5', 'Y')).collect()
   ID       Y
0   1   6.125
1   2   8.375
2   3  15.625
3   4  33.875
Attributes:
coefficients_ : DataFrame

Fitted regression coefficients.

pmml_ : DataFrame

PMML model. Set to None if no PMML model was requested.

fitted_ : DataFrame

Predicted dependent variable values for training data. Set to None if the training data has no row IDs.

statistics_ : DataFrame

Regression-related statistics, such as mean squared error.

Methods

fit(data[, key, features, label]) Fit regression model based on training data.
predict(data, key[, features]) Predict dependent variable values based on fitted model.
score(data, key[, features, label]) Returns the coefficient of determination R^2 of the prediction.
fit(data, key=None, features=None, label=None)

Fit regression model based on training data.

Parameters:
data : DataFrame

Training data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns. Since the underlying PAL_POLYNOMIAL_REGRESSION only supports one feature, this list can only contain one element. If features is not provided, data must have exactly 1 non-ID, non-label column, and features defaults to that column.

label : str, optional

Name of the dependent variable. Defaults to the last column. (This is not the PAL default.)

predict(data, key, features=None)

Predict dependent variable values based on fitted model.

Parameters:
data : DataFrame

Independent variable values used for prediction.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. Since the underlying PAL_POLYNOMIAL_REGRESSION only supports one feature, this list can only contain one element. If features is not provided, data must have exactly 1 non-ID column, and features defaults to that column.

Returns:
DataFrame
Predicted values, structured as follows:
  • ID column, with same name and type as data’s ID column.
  • VALUE, type DOUBLE, representing predicted values.

Notes

predict() will pass the pmml_ table to PAL as the model representation if there is a pmml_ table, or the coefficients_ table otherwise.

score(data, key, features=None, label=None)

Returns the coefficient of determination R^2 of the prediction.

Parameters:
data : DataFrame

Data on which to assess model performance.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. Since the underlying PAL_POLYNOMIAL_REGRESSION_PREDICT only supports one feature, this list can only contain one element. If features is not provided, data must have exactly 1 non-ID, non-label column, and features defaults to that column.

label : str, optional

Name of the dependent variable. Defaults to the last column. (This is not the PAL default.)

Returns:
accuracy : float

The coefficient of determination R^2 of the prediction on the given data.

hana_ml.algorithms.pal.stats

This module contains PAL wrappers for statistics algorithms.

The following functions are available:

hana_ml.algorithms.pal.stats.chi_squared_goodness_of_fit(conn_context, data, key, observed_data=None, expected_freq=None)

Perform the chi-squared goodness-of fit test to tell whether or not an observed distribution differs from an expected chi-squared distribution.

Parameters:
conn_context : ConnectionContext

Database connection object.

data : DataFrame

Input data.

key : str

Name of the ID column.

observed_data : str, optional

Name of column for counts of actual observations belonging to each category. If not given, the input dataframe must only have three columns. The first of the non-ID columns will be observed_data.

expected_freq : str, optional

Name of the expected frequency column. If not given, the input dataframe must only have three columns. The second of the non-ID columns will be expected_freq.

Returns:
count_comparison_df : DataFrame

Comparsion between the actual counts and the expected counts, structured as follows:

  • ID column, with same name and type as data’s ID column.
  • Observed data column, with same name as data’s observed_data column, but always with type DOUBLE.
  • EXPECTED, type DOUBLE, expected count in each category.
  • RESIDUAL, type DOUBLE, the difference between the observed counts and the expected counts.
stat_df : DataFrame

Statistical outputs, including the calculated chi-squared value, degrees of freedom and p-value, structured as follows:

  • STAT_NAME, type NVARCHAR(100), name of statistics.
  • STAT_VALUE, type DOUBLE, value of statistics.

Examples

Data to test:

>>> df = cc.table('PAL_CHISQTESTFIT_DATA_TBL')
>>> df.collect()
   ID  OBSERVED    P
0   0     519.0  0.3
1   1     364.0  0.2
2   2     363.0  0.2
3   3     200.0  0.1
4   4     212.0  0.1
5   5     193.0  0.1

Perform chi_squared_goodness_of_fit:

>>> res, stat = chi_squared_goodness_of_fit(cc, df, 'ID')
>>> res.collect()
   ID  OBSERVED  EXPECTED  RESIDUAL
0   0     519.0     555.3     -36.3
1   1     364.0     370.2      -6.2
2   2     363.0     370.2      -7.2
3   3     200.0     185.1      14.9
4   4     212.0     185.1      26.9
5   5     193.0     185.1       7.9
>>> stat.collect()
           STAT_NAME  STAT_VALUE
0  Chi-squared Value    8.062669
1  degree of freedom    5.000000
2            p-value    0.152815
hana_ml.algorithms.pal.stats.chi_squared_independence(conn_context, data, key, observed_data=None, correction=False)

Perform the chi-squared test of independence to tell whether observations of two variables are independent from each other.

Parameters:
conn_context : ConnectionContext

Database connection object.

data : DataFrame

Input data.

key : str

Name of the ID column.

observed_data : list of str, optional

Names of the observed data columns. If not given, it defaults to the all the non-ID columns.

correction : bool, optional

If True, and the degrees of freedom is 1, apply Yates’s correction for continuity. The effect of the correction is to adjust each observed value by 0.5 towards the corresponding expected value. Defaults to False.

Returns:
expected_count_df : DataFrame
The expected count table, structured as follows:
  • ID column, with same name and type as data’s ID column.
  • Expected count columns, named by prepending Expected_ to each observed_data column name, type DOUBLE. There will be as many columns here as there are observed_data columns.
stat_df : DataFrame

Statistical outputs, including the calculated chi-squared value, degrees of freedom and p-value, structured as follows:

  • STAT_NAME, type NVARCHAR(100), name of statistics.
  • STAT_VALUE, type DOUBLE, value of statistics.

Examples

Data to test:

>>> df = cc.table('PAL_CHISQTESTIND_DATA_TBL')
>>> df.collect()
       ID  X1    X2  X3    X4
0    male  25  23.0  11  14.0
1  female  41  20.0  18   6.0

Perform chi-squared test of independence:

>>> res, stats = chi_squared_independence(cc, df, 'ID')
>>> res.collect()
       ID  EXPECTED_X1  EXPECTED_X2  EXPECTED_X3  EXPECTED_X4
0    male    30.493671    19.867089    13.398734     9.240506
1  female    35.506329    23.132911    15.601266    10.759494
>>> stats.collect()
           STAT_NAME  STAT_VALUE
0  Chi-squared Value    8.113152
1  degree of freedom    3.000000
2            p-value    0.043730
hana_ml.algorithms.pal.stats.covariance_matrix(conn_context, data, cols=None)

Computes the covariance matrix.

Parameters:
conn_context : ConnectionContext

Connection to the HANA system.

data : DataFrame

Input data.

cols : list of str, optional

List of column names to analyze. If ‘cols’ is not provided, it defaults to all columns.

Returns:
covariance_matrix : DataFrame
Covariance between any two data samples (columns).
  • ID, type NVARCHAR. The values of this column are the column names from cols.
  • Covariance columns, type DOUBLE, named after the columns in cols. The covariance between variables X and Y is in column X, in the row with ID value Y.

Examples

Dataset to be analyzed:

>>> df.collect()
    X     Y
0   1   2.4
1   5   3.5
2   3   8.9
3  10  -1.4
4  -4  -3.5
5  11  32.8

Compute the covariance matrix:

>>> result = covariance_matrix(conn, df)

Outputs:

>>> result.collect()
  ID          X           Y
0  X  31.866667   44.473333
1  Y  44.473333  176.677667
hana_ml.algorithms.pal.stats.f_oneway(conn_context, data, group=None, sample=None, multcomp_method=None, significance_level=None)

Performs a 1-way ANOVA.

The purpose of one-way ANOVA is to determine whether there is any statistically significant difference between the means of three or more independent groups.

Parameters:
conn_context : ConnectionContext

Connection to the HANA system.

data : DataFrame

Input data.

group : str

Name of the group column. If group is not provided, defaults to the first column.

sample : str, optional

Name of the sample measurement column. If sample is not provided, data must have exactly 1 non-group column and sample defaults to that column.

multcomp_method : str, optional

Method used to perform multiple comparison tests. Should be one of the following:

  • ‘tukey-kramer’
  • ‘bonferroni’
  • ‘dunn-sidak’
  • ‘scheffe’
  • ‘fisher-lsd’

Defaults to tukey-kramer.

significance_level : float, optional

The significance level when the function calculates the confidence interval in multiple comparison tests. Values must be greater than 0 and less than 1. Defaults to 0.05.

Returns:
statistics_df : DataFrame
Statistics for each group, structured as follows:
  • GROUP, type NVARCHAR(256), group name.
  • VALID_SAMPLES, type INTEGER, number of valid samples.
  • MEAN, type DOUBLE, group mean.
  • SD, type DOUBLE, group standard deviation.
ANOVA_df : DataFrame
Computed results for ANOVA, structured as follows:
  • VARIABILITY_SOURCE, type NVARCHAR(100), source of variability, including between groups, within groups (error) and total.
  • SUM_OF_SQUARES, type DOUBLE, sum of squares.
  • DEGREES_OF_FREEDOM, type DOUBLE, degrees of freedom.
  • MEAN_SQUARES, type DOUBLE, mean squares.
  • F_RATIO, type DOUBLE, calculated as mean square between groups divided by mean square of error.
  • P_VALUE, type DOUBLE, associated p-value from the F-distribution.
multiple_comparison_df : DataFrame
Multiple comparison results, structured as follows:
  • FIRST_GROUP, type NVARCHAR(256), the name of the first group to conduct pairwise test on.
  • SECOND_GROUP, type NVARCHAR(256), the name of the second group to conduct pairwise test on.
  • MEAN_DIFFERENCE, type DOUBLE, mean difference between the two groups.
  • SE, type DOUBLE, standard error computed from all data.
  • P_VALUE, type DOUBLE, p-value.
  • CI_LOWER, type DOUBLE, the lower limit of the confidence interval.
  • CI_UPPER, type DOUBLE, the upper limit of the confidence interval.

Examples

Samples for One Way ANOVA test:

>>> df.collect()
   GROUP  DATA
0      A   4.0
1      A   5.0
2      A   4.0
3      A   3.0
4      A   2.0
5      A   4.0
6      A   3.0
7      A   4.0
8      B   6.0
9      B   8.0
10     B   4.0
11     B   5.0
12     B   4.0
13     B   6.0
14     B   5.0
15     B   8.0
16     C   6.0
17     C   7.0
18     C   6.0
19     C   6.0
20     C   7.0
21     C   5.0

Perform one-way ANOVA test:

>>> stats, anova, mult_comp= f_oneway(conn, df,
...                                   multcomp_method='Tukey-Kramer',
...                                   significance_level=0.05)

Outputs:

>>> stats.collect()
   GROUP  VALID_SAMPLES      MEAN        SD
0      A              8  3.625000  0.916125
1      B              8  5.750000  1.581139
2      C              6  6.166667  0.752773
3  Total             22  5.090909  1.600866
>>> anova.collect()
  VARIABILITY_SOURCE  SUM_OF_SQUARES  DEGREES_OF_FREEDOM  MEAN_SQUARES  \
0              Group       27.609848                 2.0     13.804924
1              Error       26.208333                19.0      1.379386
2              Total       53.818182                21.0           NaN
     F_RATIO   P_VALUE
0  10.008021  0.001075
1        NaN       NaN
2        NaN       NaN
>>> mult_comp.collect()
  FIRST_GROUP SECOND_GROUP  MEAN_DIFFERENCE        SE   P_VALUE  CI_LOWER  \
0           A            B        -2.125000  0.587236  0.004960 -3.616845
1           A            C        -2.541667  0.634288  0.002077 -4.153043
2           B            C        -0.416667  0.634288  0.790765 -2.028043
   CI_UPPER
0 -0.633155
1 -0.930290
2  1.194710
hana_ml.algorithms.pal.stats.f_oneway_repeated(conn_context, data, subject_id, measures=None, multcomp_method=None, significance_level=None, se_type=None)

Performs one-way repeated measures analysis of variance, along with Mauchly’s Test of Sphericity and post hoc multiple comparison tests.

Parameters:
conn_context : ConnectionContext

Connection to the HANA system.

data : DataFrame

Input data.

subject_id : str

Name of the subject ID column. The algorithm treats each row of the data table as a different subject. Hence there should be no duplicate subject IDs in this column.

measures : list of str, optional

Names of the groups (measures). If measures is not provided, defaults to all non-subject_id columns.

multcomp_method : str, optional

Method used to perform multiple comparison tests. Should be one of the following:

  • ‘tukey-kramer’
  • ‘bonferroni’
  • ‘dunn-sidak’
  • ‘scheffe’
  • ‘fisher-lsd’

Defaults to bonferroni.

significance_level : float, optional

The significance level when the function calculates the confidence interval in multiple comparison tests. Values must be greater than 0 and less than 1. Defaults to 0.05.

se_type : {‘all-data’, ‘two-group’}
Type of standard error used in multiple comparison tests.
  • ‘all-data’: computes the standard error from all data. It has more power if the assumption of sphericity is true, especially with small data sets.
  • ‘two-group’: computes the standard error from only the two groups being compared. It doesn’t assume sphericity.

Defaults to two-group.

Returns:
statistics_df : DataFrame
Statistics for each group, structured as follows:
  • GROUP, type NVARCHAR(256), group name.
  • VALID_SAMPLES, type INTEGER, number of valid samples.
  • MEAN, type DOUBLE, group mean.
  • SD, type DOUBLE, group standard deviation.
Mauchly_test_df : DataFrame
Mauchly test results, structured as follows:
  • STAT_NAME, type NVARCHAR(100), names of test result quantities.
  • STAT_VALUE, type DOUBLE, values of test result quantities.
ANOVA_df : DataFrame
Computed results for ANOVA, structured as follows:
  • VARIABILITY_SOURCE, type NVARCHAR(100), source of variability, divided into group, error and subject portions.
  • SUM_OF_SQUARES, type DOUBLE, sum of squares.
  • DEGREES_OF_FREEDOM, type DOUBLE, degrees of freedom.
  • MEAN_SQUARES, type DOUBLE, mean squares.
  • F_RATIO, type DOUBLE, calculated as mean square between groups divided by mean square of error.
  • P_VALUE, type DOUBLE, associated p-value from the F-distribution.
  • P_VALUE_GG, type DOUBLE, p-value of Greehouse-Geisser correction.
  • P_VALUE_HF, type DOUBLE, p-value of Huynh-Feldt correction.
  • P_VALUE_LB, type DOUBLE, p-value of lower bound correction.
multiple_comparison_df : DataFrame
Multiple comparison results, structured as follows:
  • FIRST_GROUP, type NVARCHAR(256), the name of the first group to conduct pairwise test on.
  • SECOND_GROUP, type NVARCHAR(256), the name of the second group to conduct pairwise test on.
  • MEAN_DIFFERENCE, type DOUBLE, mean difference between the two groups.
  • SE, type DOUBLE, standard error computed from all data or compared two groups, depending on se_type.
  • P_VALUE, type DOUBLE, p-value.
  • CI_LOWER, type DOUBLE, the lower limit of the confidence interval.
  • CI_UPPER, type DOUBLE, the upper limit of the confidence interval.

Examples

Samples for One Way Repeated ANOVA test:

>>> df.collect()
  ID  MEASURE1  MEASURE2  MEASURE3  MEASURE4
0  1       8.0       7.0       1.0       6.0
1  2       9.0       5.0       2.0       5.0
2  3       6.0       2.0       3.0       8.0
3  4       5.0       3.0       1.0       9.0
4  5       8.0       4.0       5.0       8.0
5  6       7.0       5.0       6.0       7.0
6  7      10.0       2.0       7.0       2.0
7  8      12.0       6.0       8.0       1.0

Perform one-way repeated measures ANOVA test:

>>> stats, mtest, anova, mult_comp = f_oneway_repeated(
...     conn,
...     df,
...     subject_id='ID',
...     multcomp_method='bonferroni',
...     significance_level=0.05,
...     se_type='two-group')

Outputs:

>>> stats.collect()
      GROUP  VALID_SAMPLES   MEAN        SD
0  MEASURE1              8  8.125  2.232071
1  MEASURE2              8  4.250  1.832251
2  MEASURE3              8  4.125  2.748376
3  MEASURE4              8  5.750  2.915476
>>> mtest.collect()
                    STAT_NAME  STAT_VALUE
0                 Mauchly's W    0.136248
1                  Chi-Square   11.405981
2                          df    5.000000
3                      pValue    0.046773
4  Greenhouse-Geisser Epsilon    0.532846
5         Huynh-Feldt Epsilon    0.665764
6         Lower bound Epsilon    0.333333
>>> anova.collect()
  VARIABILITY_SOURCE  SUM_OF_SQUARES  DEGREES_OF_FREEDOM  MEAN_SQUARES  \
0              Group          83.125                 3.0     27.708333
1            Subject          17.375                 7.0      2.482143
2              Error         153.375                21.0      7.303571
    F_RATIO  P_VALUE  P_VALUE_GG  P_VALUE_HF  P_VALUE_LB
0  3.793806  0.02557    0.062584    0.048331    0.092471
1       NaN      NaN         NaN         NaN         NaN
2       NaN      NaN         NaN         NaN         NaN
>>> mult_comp.collect()
  FIRST_GROUP SECOND_GROUP  MEAN_DIFFERENCE        SE   P_VALUE  CI_LOWER  \
0    MEASURE1     MEASURE2            3.875  0.811469  0.012140  0.924655
1    MEASURE1     MEASURE3            4.000  0.731925  0.005645  1.338861
2    MEASURE1     MEASURE4            2.375  1.792220  1.000000 -4.141168
3    MEASURE2     MEASURE3            0.125  1.201747  1.000000 -4.244322
4    MEASURE2     MEASURE4           -1.500  1.336306  1.000000 -6.358552
5    MEASURE3     MEASURE4           -1.625  1.821866  1.000000 -8.248955
   CI_UPPER
0  6.825345
1  6.661139
2  8.891168
3  4.494322
4  3.358552
5  4.998955
hana_ml.algorithms.pal.stats.pearsonr_matrix(conn_context, data, cols=None)

Computes a correlation matrix using Pearson’s correlation coefficient.

Parameters:
conn_context : ConnectionContext

Connection to the HANA system.

data : DataFrame

Input data.

cols : list of str, optional

List of column names to analyze. If ‘cols’ is not provided, it defaults to all columns.

Returns:
pearsonr_matrix : DataFrame

Pearson’s correlation coefficient between any two data samples (columns).

  • ID, type NVARCHAR. The values of this column are the column names from cols.
  • Correlation coefficient columns, type DOUBLE, named after the columns in cols. The correlation coefficient between variables X and Y is in column X, in the row with ID value Y.

Examples

Dataset to be analyzed:

>>> df.collect()
    X     Y
0   1   2.4
1   5   3.5
2   3   8.9
3  10  -1.4
4  -4  -3.5
5  11  32.8

Compute the Pearson’s correlation coefficient matrix:

>>> result = pearsonr_matrix(conn, df)

Outputs:

>>> result.collect()
  ID               X               Y
0  X               1  0.592707653621
1  Y  0.592707653621               1
hana_ml.algorithms.pal.stats.ttest_1samp(conn_context, data, col=None, mu=0, test_type='two_sides', conf_level=0.95)

Perform the t-test to determine whether a sample of observations could have been generated by a process with a specific mean.

Parameters:
conn_context : ConnectionContext

Database connection object.

data : DataFrame

DataFrame containing the data.

col : str, optional

Name of the column for sample. If not given, the input dataframe must only have one column.

mu : float, optional

Hypothesized mean of the population underlying the sample. Default value: 0

test_type : string, optional
The alternative hypothesis type.
  • ‘two_sides’
  • ‘less’
  • ‘greater’

Default value: two_sides

conf_level : float, optional

Confidence level for alternative hypothesis confidence interval. Default value: 0.95

Returns:
stat_df : DataFrame

DataFrame containing the statistics results from the t-test.

Examples

Original data:

>>> df.collect()
    X1
0  1.0
1  2.0
2  4.0
3  7.0
4  3.0

Perform One Sample T-Test

>>> ttest_1samp(conn, df).collect()
           STAT_NAME  STAT_VALUE
0            t-value    3.302372
1  degree of freedom    4.000000
2            p-value    0.029867
3      _PAL_MEAN_X1_    3.400000
4   confidence level    0.950000
5         lowerLimit    0.541475
6         upperLimit    6.258525
hana_ml.algorithms.pal.stats.ttest_ind(conn_context, data, col1=None, col2=None, mu=0, test_type='two_sides', var_equal=False, conf_level=0.95)

Perform the T-test for the mean difference of two independent samples.

Parameters:
conn_context : ConnectionContext

Connection to the HANA system.

data : DataFrame

DataFrame containing the data.

col1 : str, optional

Name of the column for sample1. If not given, the input dataframe must only have two columns. The first of the columns will be col1.

col2 : str, optional

Name of the column for sample2. If not given, the input dataframe must only have two columns. The second of the columns will be col2.

mu : float, optional

Hypothesized difference between the two underlying population means. Default value: 0

test_type : string, optional
The alternative hypothesis type.
  • ‘two_sides’
  • ‘less’
  • ‘greater’

Default value: two_sides

var_equal : bool, optional

Controls whether to assume that the two samples have equal variance. Default value: False

conf_level : float, optional

Confidence level for alternative hypothesis confidence interval. Default value: 0.95

Returns:
stat_df : DataFrame

DataFrame containing the statistics results from the t-test.

Examples

Original data:

>>> df.collect()
    X1    X2
0  1.0  10.0
1  2.0  12.0
2  4.0  11.0
3  7.0  15.0
4  NaN  10.0

Perform Independent Sample T-Test

>>> ttest_ind(conn, df).collect()
           STAT_NAME  STAT_VALUE
0            t-value   -5.013774
1  degree of freedom    5.649757
2            p-value    0.002875
3      _PAL_MEAN_X1_    3.500000
4      _PAL_MEAN_X2_   11.600000
5   confidence level    0.950000
6         lowerLimit  -12.113278
7         upperLimit   -4.086722
hana_ml.algorithms.pal.stats.ttest_paired(conn_context, data, col1=None, col2=None, mu=0, test_type='two_sides', conf_level=0.95)

Perform the t-test for the mean difference of two sets of paired samples.

Parameters:
conn_context : ConnectionContext

Database connection object.

data : DataFrame

DataFrame containing the data.

col1 : str, optional

Name of the column for sample1. If not given, the input dataframe must only have two columns. The first of two columns will be col1.

col2 : str, optional

Name of the column for sample2. If not given, the input dataframe must only have two columns. The second of the two columns will be col2.

mu : float, optional

Hypothesized difference between two underlying population means. Default value: 0

test_type : string, optional
The alternative hypothesis type.
  • ‘two_sides’
  • ‘less’
  • ‘greater’

Default value: two_sides

conf_level : float, optional

Confidence level for alternative hypothesis confidence interval. Default value: 0.95

Returns:
stat_df : DataFrame
DataFrame containing the statistics results from the t-test.

Examples

Original data:

>>> df.collect()
    X1    X2
0  1.0  10.0
1  2.0  12.0
2  4.0  11.0
3  7.0  15.0
4  3.0  10.0

perform Paired Sample T-Test

>>> ttest_paired(conn, df).collect()
                STAT_NAME  STAT_VALUE
0                 t-value  -14.062884
1       degree of freedom    4.000000
2                 p-value    0.000148
3  _PAL_MEAN_DIFFERENCES_   -8.200000
4        confidence level    0.950000
5              lowerLimit   -9.818932
6              upperLimit   -6.581068
hana_ml.algorithms.pal.stats.univariate_analysis(conn_context, data, key=None, cols=None, categorical_variable=None, significance_level=None, trimmed_percentage=None)

Provides an overview of the dataset. For continuous columns, it provides the count of valid observations, min, lower quartile, median, upper quartile, max, mean, confidence interval for the mean (lower and upper bound), trimmed mean, variance, standard deviation, skewness, and kurtosis. For discrete columns, it provides the number of occurrences and the percentage of the total data in each category.

Parameters:
conn_context : ConnectionContext

Connection to the HANA system.

data : DataFrame

Input data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

cols : list of str, optional

List of column names to analyze. If cols is not provided, it defaults to all non-ID columns.

categorical_variable : list of str, optional

INTEGER columns specified in this list will be treated as categorical data. By default, INTEGER columns are treated as continuous.

significance_level : float, optional

The significance level when the function calculates the confidence interval of the sample mean. Values must be greater than 0 and less than 1. Defaults to 0.05.

trimmed_percentage : float, optional

The ratio of data at both head and tail that will be dropped in the process of calculating the trimmed mean. Value range is from 0 to 0.5. Defaults to 0.05.

Returns:
continuous_result : DataFrame
Statistics for continuous variables, structured as follows:
  • VARIABLE_NAME, type NVARCHAR(256), variable names.
  • STAT_NAME, type NVARCHAR(100), names of statistical quantities, including the count of valid observations, min, lower quartile, median, upper quartile, max, mean, confidence interval for the mean (lower and upper bound), trimmed mean, variance, standard deviation, skewness, and kurtosis (14 quantities in total).
  • STAT_VALUE, type DOUBLE, values for the corresponding statistical quantities.
categorical_result : DataFrame
Statistics for categorical variables, structured as follows:
  • VARIABLE_NAME, type NVARCHAR(256), variable names.
  • CATEGORY, type NVARCHAR(256), category names of the corresponding variables. Null is also treated as a category.
  • STAT_NAME, type NVARCHAR(100), names of statistical quantities: number of observations, percentage of total data points falling in the current category for a variable (including null).
  • STAT_VALUE, type DOUBLE, values for the corresponding statistical quantities.

Examples

Dataset to be analyzed:

>>> df.collect()
      X1    X2  X3 X4
0    1.2  None   1  A
1    2.5  None   2  C
2    5.2  None   3  A
3  -10.2  None   2  A
4    8.5  None   2  C
5  100.0  None   3  B

Perform univariate analysis:

>>> continuous, categorical = univariate_analysis(
...     conn,
...     df,
...     categorical_variable=['X3'],
...     significance_level=0.05,
...     trimmed_percentage=0.2)

Outputs:

>>> continuous.collect()
   VARIABLE_NAME                 STAT_NAME   STAT_VALUE
0             X1        valid observations     6.000000
1             X1                       min   -10.200000
2             X1            lower quartile     1.200000
3             X1                    median     3.850000
4             X1            upper quartile     8.500000
5             X1                       max   100.000000
6             X1                      mean    17.866667
7             X1  CI for mean, lower bound   -24.879549
8             X1  CI for mean, upper bound    60.612883
9             X1              trimmed mean     4.350000
10            X1                  variance  1659.142667
11            X1        standard deviation    40.732575
12            X1                  skewness     1.688495
13            X1                  kurtosis     1.036148
14            X2        valid observations     0.000000
>>> categorical.collect()
   VARIABLE_NAME      CATEGORY      STAT_NAME  STAT_VALUE
0             X3  __PAL_NULL__          count    0.000000
1             X3  __PAL_NULL__  percentage(%)    0.000000
2             X3             1          count    1.000000
3             X3             1  percentage(%)   16.666667
4             X3             2          count    3.000000
5             X3             2  percentage(%)   50.000000
6             X3             3          count    2.000000
7             X3             3  percentage(%)   33.333333
8             X4  __PAL_NULL__          count    0.000000
9             X4  __PAL_NULL__  percentage(%)    0.000000
10            X4             A          count    3.000000
11            X4             A  percentage(%)   50.000000
12            X4             B          count    1.000000
13            X4             B  percentage(%)   16.666667
14            X4             C          count    2.000000
15            X4             C  percentage(%)   33.333333

hana_ml.algorithms.pal.svm

This module contains PAL wrapper and helper functions for Support Vector Machine algorithms.

The following classes are available:

class hana_ml.algorithms.pal.svm.OneClassSVM(conn_context, c=None, kernel='rbf', degree=None, gamma=None, coef_lin=None, coef_const=None, shrink=True, tol=None, evaluation_seed=None, thread_ratio=None, nu=None, scale_info=None, categorical_variable=None, category_weight=None)

Bases: hana_ml.algorithms.pal.svm._SVMBase

One Class SVM

Parameters:
conn_context : ConnectionContext

Connection to the HANA system.

c : float, optional

Trade-off between training error and margin. Value range: > 0. Defaults to 100.0.

kernel : {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’}, optional

Defaults to rbf.

degree : int, optional

Coefficient for the poly kernel type. Value range: >= 1. Defaults to 3.

gamma : float, optional

Coefficient for the rbf kernel type. Defaults to to 1.0/number of features in the dataset. Only valid when kernel is rbf.

coef_lin : float, optional

Coefficient for the poly/sigmoid kernel type. Defaults to 0.

coef_const : float, optional

Coefficient for the poly/sigmoid kernel type. Defaults to 0.

shrink : bool, optional

If true, use shrink strategy. Defaults to True.

tol : float, optional

Specifies the error tolerance in the training process. Value range: > 0. Defaults to 0.001.

evaluation_seed : int, optional

The random seed in parameter selection. Value range: >= 0. Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.0.

nu : float, optional

The value for both the upper bound of the fraction of training errors and the lower bound of the fraction of support vectors. Defaults to 0.5.

scale_info : {‘no’, ‘standardization’, ‘rescale’}, optional

Options:

  • ‘no’ : No scale.
  • ‘standardization’ : Transforms the data to have zero mean and unit variance.
  • ‘rescale’ : Rescales the range of the features to scale the range in [-1,1].

Defaults to standardization.

categorical_variable : list of str, optional

Column names in the data table used as category variable.

category_weight : float, optional

Represents the weight of category attributes. Value range: > 0. Defaults to 0.707.

Examples

Training data:

>>> df_fit.head(10).collect()
   ID  ATTRIBUTE1  ATTRIBUTE2  ATTRIBUTE3 ATTRIBUTE4
0   0         1.0        10.0       100.0          A
1   1         1.1        10.1       100.0          A
2   2         1.2        10.2       100.0          A
3   3         1.3        10.4       100.0          A
4   4         1.2        10.3       100.0         AB
5   5         4.0        40.0       400.0         AB
6   6         4.1        40.1       400.0         AB
7   7         4.2        40.2       400.0         AB
8   8         4.3        40.4       400.0         AB
9   9         4.2        40.3       400.0         AB

Create OneClassSVM instance and call fit:

>>> svc_one = svm.OneClassSVM(conn, scale_info='no', category_weight=1)
>>> svc_one.fit(df_fit, 'ID', ['ATTRIBUTE1', 'ATTRIBUTE2', 'ATTRIBUTE3',
...                            'ATTRIBUTE4'])
>>> df_predict = conn.table("DATA_TBL_SVC_ONE_PREDICT")
>>> df_predict.head(10).collect()
   ID  ATTRIBUTE1  ATTRIBUTE2  ATTRIBUTE3 ATTRIBUTE4
0   0         1.0        10.0       100.0          A
1   1         1.1        10.1       100.0          A
2   2         1.2        10.2       100.0          A
3   3         1.3        10.4       100.0          A
4   4         1.2        10.3       100.0         AB
5   5         4.0        40.0       400.0         AB
6   6         4.1        40.1       400.0         AB
7   7         4.2        40.2       400.0         AB
8   8         4.3        40.4       400.0         AB
9   9         4.2        40.3       400.0         AB
>>> features = ['ATTRIBUTE1', 'ATTRIBUTE2', 'ATTRIBUTE3',
...             'ATTRIBUTE4']

Call predict:

>>> svc_one.predict(df_predict, 'ID', features).head(10).collect()
   ID SCORE PROBABILITY
0   0    -1        None
1   1     1        None
2   2     1        None
3   3    -1        None
4   4    -1        None
5   5    -1        None
6   6    -1        None
7   7     1        None
8   8    -1        None
9   9    -1        None
Attributes:
model_ : DataFrame

Model content.

Methods

fit(data[, key, features]) Fit the model when given training dataset and other attributes.
predict(data, key[, features]) Predict the dataset using the trained model.
fit(data, key=None, features=None)

Fit the model when given training dataset and other attributes.

Parameters:
data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

predict(data, key, features=None)

Predict the dataset using the trained model.

Parameters:
data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

Returns:
DataFrame
Predict result, structured as follows:
  • ID column, with the same name and type as data’s ID column.
  • Score, type NVARCHAR(100), prediction value.
  • PROBABILITY, type DOUBLE, prediction probability. Always NULL. This column is only used for SVC and SVRanking.
class hana_ml.algorithms.pal.svm.SVC(conn_context, c=None, kernel='rbf', degree=None, gamma=None, coef_lin=None, coef_const=None, probability=False, shrink=True, tol=None, evaluation_seed=None, thread_ratio=None, scale_info=None, categorical_variable=None, category_weight=None)

Bases: hana_ml.algorithms.pal.svm._SVMBase

Support Vector Classification

Parameters:
conn_context : ConnectionContext

Connection to the HANA system.

c : float, optional

Trade-off between training error and margin. Value range: > 0. Defaults to 100.0.

kernel : {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’}, optional

Defaults to rbf.

degree : int, optional

Coefficient for the poly kernel type. Value range: >= 1. Defaults to 3.

gamma : float, optional

Coefficient for the rbf kernel type. Defaults to 1.0/number of features in the dataset. Only valid for when kernel is rbf.

coef_lin : float, optional

Coefficient for the poly/sigmoid kernel type. Defaults to 0.

coef_const : float, optional

Coefficient for the poly/sigmoid kernel type. Defaults to 0.

probability : bool, optional

If true, output probability during prediction. Defaults to False.

shrink : bool, optional

If true, use shrink strategy. Defaults to True.

tol : float, optional

Specifies the error tolerance in the training process. Value range: > 0. Defaults to 0.001.

evaluation_seed : int, optional

The random seed in parameter selection. Value range: >= 0. Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.0.

scale_info : {‘no’, ‘standardization’, ‘rescale’}, optional

Options:

  • ‘no’ : No scale.
  • ‘standardization’ : Transforms the data to have zero mean and unit variance.
  • ‘rescale’ : Rescales the range of the features to scale the range in [-1,1].

Defaults to standardization.

categorical_variable : list of str, optional

Column names in the data table used as category variable.

category_weight : float, optional

Represents the weight of category attributes. Value range: > 0. Defaults to 0.707.

Examples

Training data:

>>> df_fit.head(10).collect()
   ID  ATTRIBUTE1  ATTRIBUTE2  ATTRIBUTE3 ATTRIBUTE4  LABEL
0   0         1.0        10.0       100.0          A      1
1   1         1.1        10.1       100.0          A      1
2   2         1.2        10.2       100.0          A      1
3   3         1.3        10.4       100.0          A      1
4   4         1.2        10.3       100.0         AB      1
5   5         4.0        40.0       400.0         AB      2
6   6         4.1        40.1       400.0         AB      2
7   7         4.2        40.2       400.0         AB      2
8   8         4.3        40.4       400.0         AB      2
9   9         4.2        40.3       400.0         AB      2

Create SVC instance and call fit:

>>> svc = svm.SVC(connection_context, gamma=0.005)
>>> svc.fit(df_fit, 'ID', ['ATTRIBUTE1', 'ATTRIBUTE2',
...                        'ATTRIBUTE3', 'ATTRIBUTE4'])
>>> df_predict = connection_context.table("SVC_PREDICT_DATA_TBL")
>>> df_predict.collect()
   ID  ATTRIBUTE1  ATTRIBUTE2  ATTRIBUTE3 ATTRIBUTE4
0   0         1.0        10.0       100.0          A
1   1         1.2        10.2       100.0          A
2   2         4.1        40.1       400.0         AB
3   3         4.2        40.3       400.0         AB
4   4         9.1        90.1       900.0          A
5   5         9.2        90.2       900.0          A
6   6         4.0        40.0       400.0          A

Call predict:

>>> res = svc.predict(df_predict, 'ID', ['ATTRIBUTE1', 'ATTRIBUTE2',
...                                      'ATTRIBUTE3', 'ATTRIBUTE4'])
>>> res.collect()
   ID SCORE PROBABILITY
0   0     1        None
1   1     1        None
2   2     2        None
3   3     2        None
4   4     3        None
5   5     3        None
6   6     2        None
Attributes:
model_ : DataFrame

Model content.

Methods

fit(data[, key, features, label]) Fit the model when given training dataset and other attributes.
predict(data, key[, features, verbose]) Predict the dataset using the trained model.
score(data, key[, features, label]) Returns the accuracy on the given test data and labels.
fit(data, key=None, features=None, label=None)

Fit the model when given training dataset and other attributes.

Parameters:
data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID, non-label columns.

label : str, optional

Name of the label column. If label is not provided, it defaults to the last column.

predict(data, key, features=None, verbose=False)

Predict the dataset using the trained model.

Parameters:
data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID, non-label columns.

verbose : bool, optional

If true, output scoring probabilities for each class. It is only applicable when probability is true during instance creation. Defaults to False.

Returns:
DataFrame
Predict result, structured as follows:
  • ID column, with the same name and type as data’s ID column.
  • SCORE, type NVARCHAR(100), prediction value.
  • PROBABILITY, type DOUBLE, prediction probability. It is NULL when probability is False during instance creation.
score(data, key, features=None, label=None)

Returns the accuracy on the given test data and labels.

Parameters:
data : DataFrame

DataFrame containing the data.

key : str.

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID, non-label columns.

label : str, optional

Name of the label column. If label is not provided, it defaults to the last column.

Returns:
accuracy : float

Scalar accuracy value comparing the predicted result and original label.

class hana_ml.algorithms.pal.svm.SVR(conn_context, c=None, kernel='rbf', degree=None, gamma=None, coef_lin=None, coef_const=None, shrink=True, tol=None, evaluation_seed=None, thread_ratio=None, scale_info=None, scale_label=None, categorical_variable=None, category_weight=None, regression_eps=None)

Bases: hana_ml.algorithms.pal.svm._SVMBase

Support Vector Regression

Parameters:
conn_context : ConnectionContext

Connection to the HANA system.

c : float, optional

Trade-off between training error and margin. Value range: > 0. Defaults to 100.0.

kernel : {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’}, optional

Defaults to rbf.

degree : int, optional

Coefficient for the poly kernel type. Value range: >= 1. Defaults to 3.

gamma : float, optional

Coefficient for the rbf kernel type. Defaults to 1.0/number of features in the dataset Only valid when kernel is rbf.

coef_lin : float, optional

Coefficient for the poly/sigmoid kernel type. Defaults to 0.

coef_const : float, optional

Coefficient for the poly/sigmoid kernel type. Defaults to 0.

shrink : bool, optional

If true, use shrink strategy. Defaults to True.

tol : float, optional

Specifies the error tolerance in the training process. Value range: > 0. Defaults to 0.001.

evaluation_seed : int, optional

The random seed in parameter selection. Value range: >= 0. Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.0.

scale_info : {‘no’, ‘standardization’, ‘rescale’}, optional

Options:

  • ‘no’ : No scale.
  • ‘standardization’ : Transforms the data to have zero mean and unit variance.
  • ‘rescale’ : Rescales the range of the features to scale the range in [-1,1].

Defaults to standardization.

scale_label : bool, optional

If true, standardize the label for SVR. It is only applicable when the scale_info is standardization. Defaults to True.

categorical_variable : list of str, optional

Column names in the data table used as category variable.

category_weight : float, optional

Represents the weight of category attributes. Value range: > 0. Defaults to 0.707.

regression_eps : float, optional

Epsilon width of tube for regression. Defaults to 0.1.

Examples

Training data:

>>> df_fit.collect()
    ID  ATTRIBUTE1  ATTRIBUTE2  ATTRIBUTE3  ATTRIBUTE4  ATTRIBUTE5       VALUE
0    0    0.788606    0.787308   -1.301485    1.226053   -0.533385   95.626483
1    1    0.414869   -0.381038   -0.719309    1.603499    1.557837  162.582000
2    2    0.236282   -1.118764    0.233341   -0.698410    0.387380  -56.564303
3    3   -0.087779   -0.462372   -0.038412   -0.552897    1.231209  -32.241614
4    4   -0.476389    1.836772   -0.292337   -1.364599    1.326768 -143.240878
5    5    0.523326    0.065154   -1.513822    0.498921   -0.590686   -5.237827
6    6   -1.425838   -0.900437   -0.672299    0.646424    0.508856  -43.005837
7    7   -1.601836    0.455530    0.438217   -0.860707   -0.338282 -126.389824
8    8    0.266698   -0.725057    0.462189    0.868752   -1.542683   46.633594
9    9   -0.772496   -2.192955    0.822904   -1.125882   -0.946846 -175.356260
10  10    0.492364   -0.654237   -0.226986   -0.387156   -0.585063  -49.213910
11  11    0.378409   -1.544976    0.622448   -0.098902    1.437910   34.788276
12  12    0.317183    0.473067   -1.027916    0.549077    0.013483   32.845141
13  13    1.340660   -1.082651    0.730509   -0.944931    0.351025   -6.500411
14  14    0.736456    1.649251    1.334451   -0.530776    0.280830   87.451863

Create SVR instance and call fit:

>>> svr = svm.SVR(conn, kernel='linear', scale_info='standardization',
...               scale_label=True)
>>> svr.fit(df_fit, 'ID', ['ATTRIBUTE1', 'ATTRIBUTE2', 'ATTRIBUTE3',
...                        'ATTRIBUTE4', 'ATTRIBUTE5'])
Attributes:
model_ : DataFrame

Model content.

Methods

fit(data, key[, features, label]) Fit the model when given training dataset and other attributes.
predict(data, key[, features]) Predict the dataset using the trained model.
score(data, key[, features, label]) Returns the coefficient of determination R^2 of the prediction.
fit(data, key, features=None, label=None)

Fit the model when given training dataset and other attributes.

Parameters:
data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID, non-label columns.

label : str, optional

Name of the label column. If label is not provided, it defaults to the last column.

predict(data, key, features=None)

Predict the dataset using the trained model.

Parameters:
data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.

Returns:
DataFrame
Predict result, structured as follows:
  • ID column, with the same name and type as data’s ID column.
  • SCORE, type NVARCHAR(100), prediction value.
  • PROBABILITY, type DOUBLE, prediction probability. Always NULL. This column is only used for SVC and SVRanking.
score(data, key, features=None, label=None)

Returns the coefficient of determination R^2 of the prediction.

Parameters:
data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID and non-label columns.

label : str, optional

Name of the label column. If label is not provided, it defaults to the last column.

Returns:
accuracy : float

Returns the coefficient of determination R^2 of the prediction.

class hana_ml.algorithms.pal.svm.SVRanking(conn_context, c=None, kernel='rbf', degree=None, gamma=None, coef_lin=None, coef_const=None, probability=False, shrink=True, tol=None, evaluation_seed=None, thread_ratio=None, scale_info=None, categorical_variable=None, category_weight=None)

Bases: hana_ml.algorithms.pal.svm._SVMBase

Support Vector Ranking

Parameters:
conn_context : ConnectionContext

Connection to the HANA system.

c : float, optional

Trade-off between training error and margin. Value range: > 0. Defaults to 100.

kernel : {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’}, optional

Defaults to rbf.

degree : int, optional

Coefficient for the poly kernel type. Value range: >= 1. Defaults to 3.

gamma : float, optional

Coefficient for the rbf kernel type. Defaults to to 1.0/number of features in the dataset. Only valid when kernel is rbf.

coef_lin : float, optional

Coefficient for the poly/sigmoid kernel type. Defaults to 0.

coef_const : float, optional

Coefficient for the poly/sigmoid kernel type. Defaults to 0.

probability : bool, optional

If true, output probability during prediction. Defaults to False.

shrink : bool, optional

If true, use shrink strategy. Defaults to True.

tol : float, optional

Specifies the error tolerance in the training process. Value range: > 0. Defaults to 0.001.

evaluation_seed : int, optional

The random seed in parameter selection. Value range: >= 0. Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.0.

scale_info : {‘no’, ‘standardization’, ‘rescale’}, optional

Options:

  • ‘no’ : No scale.
  • ‘standardization’ : Transforms the data to have zero mean and unit variance.
  • ‘rescale’ : Rescales the range of the features to scale the range in [-1,1].

Defaults to standardization.

categorical_variable : list of str, optional

Column names in the data table used as category variable.

category_weight : float, optional

Represents the weight of category attributes. Value range: > 0. Defaults to 0.707.

Notes

PAL will throw an error if probability=True is provided to the SVRanking constructor and verbose=True is not provided to predict(). This is a known bug.

Examples

Training data:

>>> df_fit.head(10).collect()
   ID  ATTRIBUTE1  ATTRIBUTE2  ATTRIBUTE3  ATTRIBUTE4  ATTRIBUTE5    QID  LABEL
0   0         1.0         1.0         0.0         0.2         0.0  qid:1      3
1   1         0.0         0.0         1.0         0.1         1.0  qid:1      2
2   2         0.0         0.0         1.0         0.3         0.0  qid:1      1
3   3         2.0         1.0         1.0         0.2         0.0  qid:1      4
4   4         3.0         1.0         1.0         0.4         1.0  qid:1      5
5   5         4.0         1.0         1.0         0.7         0.0  qid:1      6
6   6         0.0         0.0         1.0         0.2         0.0  qid:2      1
7   7         1.0         0.0         1.0         0.4         0.0  qid:2      2
8   8         0.0         0.0         1.0         0.2         0.0  qid:2      1
9   9         1.0         1.0         1.0         0.2         0.0  qid:2      3

Create SVRanking instance and call fit:

>>> svranking = svm.SVRanking(conn, gamma=0.005)
>>> features = ['ATTRIBUTE1', 'ATTRIBUTE2', 'ATTRIBUTE3', 'ATTRIBUTE4',
...             'ATTRIBUTE5']
>>> svranking.fit(df_fit, 'ID', 'QID', features, 'LABEL')

Call predict:

>>> df_predict = conn.table("DATA_TBL_SVRANKING_PREDICT")
>>> df_predict.head(10).collect()
   ID  ATTRIBUTE1  ATTRIBUTE2  ATTRIBUTE3  ATTRIBUTE4  ATTRIBUTE5    QID
0   0         1.0         1.0         0.0         0.2         0.0  qid:1
1   1         0.0         0.0         1.0         0.1         1.0  qid:1
2   2         0.0         0.0         1.0         0.3         0.0  qid:1
3   3         2.0         1.0         1.0         0.2         0.0  qid:1
4   4         3.0         1.0         1.0         0.4         1.0  qid:1
5   5         4.0         1.0         1.0         0.7         0.0  qid:1
6   6         0.0         0.0         1.0         0.2         0.0  qid:4
7   7         1.0         0.0         1.0         0.4         0.0  qid:4
8   8         0.0         0.0         1.0         0.2         0.0  qid:4
9   9         1.0         1.0         1.0         0.2         0.0  qid:4
>>> svranking.predict(df_predict, key='ID',
...                   features=features, qid='QID').head(10).collect()
    ID     SCORE PROBABILITY
0    0  -9.85138        None
1    1  -10.8657        None
2    2  -11.6741        None
3    3  -9.33985        None
4    4  -7.88839        None
5    5   -6.8842        None
6    6  -11.7081        None
7    7  -10.8003        None
8    8  -11.7081        None
9    9  -10.2583        None
Attributes:
model_ : DataFrame

Model content.

Methods

fit(data, key, qid[, features, label]) Fit the model when given training dataset and other attributes.
predict(data, key, qid[, features, verbose]) Predict the dataset using the trained model.
fit(data, key, qid, features=None, label=None)

Fit the model when given training dataset and other attributes.

Parameters:
data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

qid : str

Name of the qid column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID, non-label, non-qid columns.

label : str, optional

Name of the label column. If label is not provided, it defaults to the last column.

predict(data, key, qid, features=None, verbose=False)

Predict the dataset using the trained model.

Parameters:
data : DataFrame

DataFrame containing the data.

key : str

Name of the ID column.

qid : str

Name of the qid column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all the non-ID, non-qid columns.

verbose : bool, optional

If true, output scoring probabilities for each class. Defaults to False.

Returns:
DataFrame
Predict result, structured as follows:
  • ID column, with the same name and type as data’s ID column.
  • Score, type NVARCHAR(100), prediction value.
  • PROBABILITY, type DOUBLE, prediction probability. It is NULL when probability is False during instance creation.

hana_ml.algorithms.pal.trees

This module includes decision tree-based models for classification and regression.

The following classes are available:

class hana_ml.algorithms.pal.trees.DecisionTreeClassifier(conn_context, algorithm, thread_ratio=None, allow_missing_dependent=True, percentage=None, min_records_of_parent=None, min_records_of_leaf=None, max_depth=None, categorical_variable=None, split_threshold=None, discretization_type=None, bins=None, max_branch=None, merge_threshold=None, use_surrogate=None, model_format=None, output_rules=True, priors=None, output_confusion_matrix=True)

Bases: hana_ml.algorithms.pal.trees._DecisionTreeBase

Decision Tree model for classification.

Parameters:
conn_context : ConnectionContext

Database connection object.

algorithm : {‘c45’, ‘chaid’, ‘cart’}
Algorithm used to grow a decision tree. Case-insensitive.
  • ‘c45’: C4.5 algorithm.
  • ‘chaid’: Chi-square automatic interaction detection.
  • ‘cart’: Classification and regression tree.
thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.

allow_missing_dependent : bool, optional
Specifies if a missing target value is allowed.
  • False: Not allowed. An error occurs if a missing target is present.
  • True: Allowed. The datum with the missing target is removed.

Defaults to True.

percentage : float, optional

Specifies the percentage of the input data that will be used to build the tree model. The rest of the data will be used for pruning. Defaults to 1.0.

min_records_of_parent : int, optional

Specifies the stop condition: if the number of records in one node is less than the specified value, the algorithm stops splitting. Defaults to 2.

min_records_of_leaf : int, optional

Promises the minimum number of records in a leaf. Defaults to 1.

max_depth : int, optional

The maximum depth of a tree. By default it is unlimited.

categorical_variable : list of str, optional

Indicates features should be treated as categorical. The default behavior is: string: categorical, or integer and float: continuous. VALID only for integer variables; omitted otherwise. Default value detected from input data.

split_threshold : float, optional
Specifies the stop condition for a node:
  • C45: The information gain ratio of the best split is less than this value.
  • CHAID: The p-value of the best split is greater than or equal to this value.
  • CART: The reduction of Gini index or relative MSE of the best split is less than this value.

The smaller the SPLIT_THRESHOLD value is, the larger a C45 or CART tree grows. On the contrary, CHAID will grow a larger tree with larger SPLIT_THRESHOLD value. Defaults to 1e-5 for C45 and CART, 0.05 for CHAID.

discretization_type : {‘mdlpc’, ‘equal_freq’}, optional
Strategy for discretizing continuous attributes. Case-insensitive.
  • ‘mdlpc’: Minimum description length principle criterion.
  • ‘equal_freq’: Equal frequency discretization.

Valid only for C45 and CHAID. Defaults to mdlpc.

bins : List of tuples: (column name, number of bins), optional

Specifies the number of bins for discretization. Only valid when discretizaition type is equal_freq. Defaults to 10 for each column.

max_branch : int, optional

Specifies the maximum number of branches. Valid only for CHAID. Defaults to 10.

merge_threshold : float, optional

Specifies the merge condition for CHAID: if the metric value is greater than or equal to the specified value, the algorithm will merge two branches. Only valid for CHAID. Defaults to 0.05.

use_surrogate : bool, optional

If true, use surrogate split when NULL values are encountered. Only valid for CART. Defaults to True.

model_format : {‘json’, ‘pmml’}, optional
Specifies the tree model format for store. Case-insensitive.
  • ‘json’: export model in json format.
  • ‘pmml’: export model in pmml format.

Defaults to json.

output_rules : bool, optional

If true, output decision rules. Defaults to True.

priors : List of tuples: (class, prior_prob), optional

Specifies the prior probability of every class label. Default value detected from data.

output_confusion_matrix : bool, optional

If true, output the confusion matrix. Defaults to True.

Examples

Input dataframe for training:

>>> df1.head(4).collect()
   OUTLOOK  TEMP  HUMIDITY WINDY        CLASS
0    Sunny    75      70.0   Yes         Play
1    Sunny    80      90.0   Yes  Do not Play
2    Sunny    85      85.0    No  Do not Play
3    Sunny    72      95.0    No  Do not Play

Creating DecisionTreeClassifier instance:

>>> dtc = DecisionTreeClassifier(conn_context=cc, algorithm='c45',
...                              min_records_of_parent=2,
...                              min_records_of_leaf=1,
...                              thread_ratio=0.4, split_threshold=1e-5,
...                              model_format='json', output_rules=True)

Performing fit() on given dataframe:

>>> dtc.fit(df1, features=['OUTLOOK', 'TEMP', 'HUMIDITY', 'WINDY'],
...         label='LABEL')
>>> dtc.decision_rules_.collect()
   ROW_INDEX                                                  RULES_CONTENT
0         0                                       (TEMP>=84) => Do not Play
1         1                         (TEMP<84) && (OUTLOOK=Overcast) => Play
2         2         (TEMP<84) && (OUTLOOK=Sunny) && (HUMIDITY<82.5) => Play
3         3 (TEMP<84) && (OUTLOOK=Sunny) && (HUMIDITY>=82.5) => Do not Play
4         4       (TEMP<84) && (OUTLOOK=Rain) && (WINDY=Yes) => Do not Play
5         5               (TEMP<84) && (OUTLOOK=Rain) && (WINDY=No) => Play

Input dataframe for predicting:

>>> df2.collect()
   ID   OUTLOOK  HUMIDITY  TEMP WINDY
0   0  Overcast      75.0    70   Yes
1   1      Rain      78.0    70   Yes
2   2     Sunny      66.0    70   Yes
3   3     Sunny      69.0    70   Yes
4   4      Rain       NaN    70   Yes
5   5      None      70.0    70   Yes
6   6       ***      70.0    70   Yes

Performing predict() on given dataframe:

>>> result = dtc.predict(df2, key='ID', verbose=False)
>>> result.collect()
   ID        SCORE  CONFIDENCE
0   0         Play    1.000000
1   1  Do not Play    1.000000
2   2         Play    1.000000
3   3         Play    1.000000
4   4  Do not Play    1.000000
5   5         Play    0.692308
6   6         Play    0.692308

Input dataframe for scoring:

>>> df3.collect()
   ID   OUTLOOK  HUMIDITY  TEMP WINDY        LABEL
0   0  Overcast      75.0    70   Yes         Play
1   1      Rain      78.0    70    No  Do not Play
2   2     Sunny      66.0    70   Yes         Play
3   3     Sunny      69.0    70   Yes         Play

Performing score() on given dataframe:

>>> rfc.score(df3, key='ID')
0.75
Attributes:
model_ : DataFrame

Trained model content.

decision_rules_ : DataFrame

Rules for decision tree to make decisions. Set to None if output_rules is False.

confusion_matrix_ : DataFrame

Confusion matrix used to evaluate the performance of classification algorithms. Set to None if output_confusion_matrix is False.

Methods

fit(data[, key, features, label]) Function for building a decision tree classifier.
predict(data, key[, features, verbose]) Predict dependent variable values based on fitted model.
score(data, key[, features, label]) Returns the mean accuracy on the given test data and labels.
fit(data, key=None, features=None, label=None)

Function for building a decision tree classifier.

Parameters:
data : DataFrame

Training data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

label : str, optional

Name of the dependent variable. Defaults to the last column.

predict(data, key, features=None, verbose=False)

Predict dependent variable values based on fitted model.

Parameters:
data : DataFrame

Independent variable values to predict for.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

verbose : bool, optional

If true, output all classes and the corresponding confidences for each data point. This parameter is valid only for classification. Defaults to False.

Returns:
DataFrame
DataFrame of score and confidence, structured as follows:
  • ID column, with same name and type as data’s ID column.
  • SCORE, type DOUBLE, representing the predicted classes/values.
  • CONFIDENCE, type DOUBLE, representing the confidence of a class, all 0s if for regression.
score(data, key, features=None, label=None)

Returns the mean accuracy on the given test data and labels.

Parameters:
data : DataFrame

Data on which to assess model performance.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

label : str, optional

Name of the dependent variable. Defaults to the last column.

Returns:
float

Mean accuracy on the given test data and labels.

class hana_ml.algorithms.pal.trees.DecisionTreeRegressor(conn_context, algorithm, thread_ratio=None, allow_missing_dependent=True, percentage=None, min_records_of_parent=None, min_records_of_leaf=None, max_depth=None, categorical_variable=None, split_threshold=None, use_surrogate=None, model_format=None, output_rules=True)

Bases: hana_ml.algorithms.pal.trees._DecisionTreeBase

Decision Tree model for regression.

Parameters:
conn_context : ConnectionContext

Database connection object.

algorithm : {‘cart’}
Algorithm used to grow a decision tree.
  • ‘cart’: Classification and Regression tree.

Currently supports cart.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.

allow_missing_dependent : bool, optional
Specifies if a missing target value is allowed.
  • False: Not allowed. An error occurs if a missing target is present.
  • True: Allowed. The datum with the missing target is removed.

Defaults to True.

percentage : float, optional

Specifies the percentage of the input data that will be used to build the tree model. The rest of the data will be used for pruning. Defaults to 1.0.

min_records_of_parent : int, optional

Specifies the stop condition: if the number of records in one node is less than the specified value, the algorithm stops splitting. Defaults to 2.

min_records_of_leaf : int, optional

Promises the minimum number of records in a leaf. Defaults to 1.

max_depth : int, optional

The maximum depth of a tree. By default it is unlimited.

categorical_variable : list of str, optional

Indicates features should be treated as categorical. The default behavior is: string: categorical, or integer and float: continuous. VALID only for integer variables; omitted otherwise. Default value detected from input data.

split_threshold : float, optional
Specifies the stop condition for a node:
  • CART: The reduction of Gini index or relative MSE of the best split is less than this value.

The smaller the SPLIT_THRESHOLD value is, the larger a CART tree grows. Defaults to 1e-5 for CART.

use_surrogate : bool, optional

If true, use surrogate split when NULL values are encountered. Only valid for cart. Defaults to True.

model_format : {‘json’, ‘pmml’}, optional
Specifies the tree model format for store. Case-insensitive.
  • ‘json’: export model in json format.
  • ‘pmml’: export model in pmml format.

Defaults to json.

output_rules : bool, optional

If true, output decision rules. Defaults to True.

Examples

Input dataframe for training:

>>> df1.head(5).collect()
   ID         A         B         C         D      CLASS
0   0  1.764052  0.400157  0.978738  2.240893  49.822907
1   1  1.867558 -0.977278  0.950088 -0.151357   4.877286
2   2 -0.103219  0.410598  0.144044  1.454274  11.914875
3   3  0.761038  0.121675  0.443863  0.333674  19.753078
4   4  1.494079 -0.205158  0.313068 -0.854096  23.607000

Creating DecisionTreeRegressor instance:

>>>  dtr = DecisionTreeRegressor(conn_context=cc, algorithm='cart',
...                              min_records_of_parent=2, min_records_of_leaf=1,
...                              thread_ratio=0.4, split_threshold=1e-5,
...                              model_format='pmml', output_rules=True)

Performing fit() on given dataframe:

>>> dtr.fit(df1, key='ID')
>>> dtr.decision_rules_.head(2).collect()
   ROW_INDEX                                      RULES_CONTENT
0          0         (A<-0.495502) && (B<-0.663588) => -85.8762
1          1        (A<-0.495502) && (B>=-0.663588) => -29.9827

Input dataframe for predicting:

>>> df2.collect()
   ID         A         B         C         D
0   0  1.764052  0.400157  0.978738  2.240893
1   1  1.867558 -0.977278  0.950088 -0.151357
2   2 -0.103219  0.410598  0.144044  1.454274
3   3  0.761038  0.121675  0.443863  0.333674
4   4  1.494079 -0.205158  0.313068 -0.854096

Performing predict() on given dataframe:

>>> result = dtr.predict(df2, key='ID')
>>> result.collect()
   ID    SCORE  CONFIDENCE
0   0  49.8229         0.0
1   1  4.87728         0.0
2   2  11.9148         0.0
3   3   19.753         0.0
4   4   23.607         0.0

Input dataframe for scoring:

>>> df3.collect()
   ID         A         B         C         D      CLASS
0   0  1.764052  0.400157  0.978738  2.240893  49.822907
1   1  1.867558 -0.977278  0.950088 -0.151357   4.877286
2   2 -0.103219  0.410598  0.144044  1.454274  11.914875
3   3  0.761038  0.121675  0.443863  0.333674  19.753078
4   4  1.494079 -0.205158  0.313068 -0.854096  23.607000

Performing score() on given dataframe:

>>> dtr.score(df3, key='ID')
0.9999999999900131
Attributes:
model_ : DataFrame

Trained model content.

decision_rules_ : DataFrame

Rules for decision tree to make decisions. Set to None if output_rules is False.

Methods

fit(data[, key, features, label]) Train the model on input data.
predict(data, key[, features, verbose]) Predict dependent variable values based on fitted model.
score(data, key[, features, label]) Returns the coefficient of determination R^2 of the prediction.
fit(data, key=None, features=None, label=None)

Train the model on input data.

Parameters:
data : DataFrame

Training data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

label : str, optional

Name of the dependent variable. Defaults to the last column.

predict(data, key, features=None, verbose=False)

Predict dependent variable values based on fitted model.

Parameters:
data : DataFrame

Independent variable values to predict for.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

verbose : bool, optional

If true, output all classes and the corresponding confidences for each data point. This parameter is valid only for classification. Defaults to False.

Returns:
DataFrame
DataFrame of score and confidence, structured as follows:
  • ID column, with same name and type as data’s ID column.
  • SCORE, type DOUBLE, representing the predicted classes/values.
  • CONFIDENCE, type DOUBLE, representing the confidence of a class, all 0s if for regression.
score(data, key, features=None, label=None)

Returns the coefficient of determination R^2 of the prediction.

Parameters:
data : DataFrame

Data on which to assess model performance.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

label : str, optional

Name of the dependent variable. Defaults to the last column.

Returns:
float

The coefficient of determination R^2 of the prediction on the given data.

class hana_ml.algorithms.pal.trees.RandomForestClassifier(conn_context, n_estimators=100, max_features=None, max_depth=None, min_samples_leaf=1, split_threshold=None, calculate_oob=True, random_state=None, thread_ratio=None, allow_missing_dependent=True, categorical_variable=None, sample_fraction=None, strata=None, priors=None)

Bases: hana_ml.algorithms.pal.trees._RandomForestBase

Random forest model for classification.

Parameters:
conn_context : ConnectionContext

Connection to the HANA system.

n_estimators : int, optional

Specifies the number of trees in the random forest. Defaults to 100.

max_features : int, optional

Specifies the number of randomly selected splitting variables. Should not be larger than the number of input features. Defaults to sqrt(p) (for classification) or p/3(for regression), where p is the number of input features.

max_depth : int, optional

The maximum depth of a tree. By default it is unlimited.

min_samples_leaf : int, optional

Specifies the minimum number of records in a leaf. Defaults to 1 for classification.

split_threshold : float, optional

Specifies the stop condition: if the improvement value of the best split is less than this value, the tree stops growing. Defaults to 1e-5.

calculate_oob : bool, optional

If true, calculate the out-of-bag error. Defaults to True.

random_state : int, optional

Specifies the seed for random number generator. 0: Uses the current time (in seconds) as the seed. Others: Uses the specified value as the seed. Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to heuristically determined.

allow_missing_dependent : bool, optional
Specifies if a missing target value is allowed.
  • False: Not allowed. An error occurs if a missing target is present.
  • True: Allowed. The datum with the missing target is removed.

Defaults to True.

categorical_variable : list of str, optional

Indicates features should be treated as categorical. The default behavior is: string: categorical, or integer and float: continuous. VALID only for integer variables; omitted otherwise. Default value detected from input data.

sample_fraction : float, optional

The fraction of data used for training. Assume there are n pieces of data, sample fraction is r, then n*r data is selected for training. Defaults to 1.0.

strata : List of tuples: (class, fraction), optional

Strata proportions for stratified sampling. A (class, fraction) tuple specifies that rows with that class should make up the specified fraction of each sample. If the given fractions do not add up to 1, the remaining portion is divided equally between classes with no entry in strata, or between all classes if all classes have an entry in strata. If strata is not provided, bagging is used instead of stratified sampling.

priors : List of tuples: (class, prior_prob), optional

Prior probabilities for classes. A (class, prior_prob) tuple specifies the prior probability of this class. If the given priors do not add up to 1, the remaining portion is divided equally between classes with no entry in priors, or between all classes if all classes have an entry in ‘priors’. If priors is not provided, it is determined by the proportion of every class in the training data.

Examples

Input dataframe for training:

>>> df1.head(4).collect()
   OUTLOOK  TEMP  HUMIDITY WINDY       LABEL
0    Sunny  75.0      70.0   Yes        Play
1    Sunny   NaN      90.0   Yes Do not Play
2    Sunny  85.0       NaN    No Do not Play
3    Sunny  72.0      95.0    No Do not Play

Creating RandomForestClassifier instance:

>>> rfc = RandomForestClassifier(conn_context=cc, n_estimators=3,
...                              max_features=3, random_state=2,
...                              split_threshold=0.00001,
...                              calculate_oob=True,
...                              min_samples_leaf=1, thread_ratio=1.0)

Performing fit() on given dataframe:

>>> rfc.fit(df1, features=['OUTLOOK', 'TEMP', 'HUMIDITY', 'WINDY'],
...         label='LABEL')
>>> rfc.feature_importances_.collect()
  VARIABLE_NAME  IMPORTANCE
0       OUTLOOK    0.449550
1          TEMP    0.216216
2      HUMIDITY    0.208108
3         WINDY    0.126126

Input dataframe for predicting:

>>> df2.collect()
   ID   OUTLOOK     TEMP  HUMIDITY WINDY
0   0  Overcast     75.0  -10000.0   Yes
1   1      Rain     78.0      70.0   Yes

Performing predict() on given dataframe:

>>> result = rfc.predict(df2, key='ID', verbose=False)
>>> result.collect()
   ID SCORE  CONFIDENCE
0   0  Play    0.666667
1   1  Play    0.666667

Input dataframe for scoring:

>>> df3.collect()
   ID   OUTLOOK  TEMP  HUMIDITY WINDY LABEL
0   0     Sunny    70      90.0   Yes  Play
1   1  Overcast    81      90.0   Yes  Play
2   2      Rain    65      80.0    No  Play

Performing score() on given dataframe:

>>> rfc.score(df3, key='ID')
0.6666666666666666
Attributes:
model_ : DataFrame

Trained model content.

feature_importances_ : DataFrame

The feature importance (the higher, the more important the feature).

oob_error_ : DataFrame

Out-of-bag error rate or mean squared error for random forest up to indexed tree. Set to None if calculate_oob is False.

confusion_matrix_ : DataFrame

Confusion matrix used to evaluate the performance of classification algorithms.

Methods

fit(data[, key, features, label]) Train the model on input data.
predict(data, key[, features, verbose, …]) Predict dependent variable values based on fitted model.
score(data, key[, features, label, …]) Returns the mean accuracy on the given test data and labels.
fit(data, key=None, features=None, label=None)

Train the model on input data.

Parameters:
data : DataFrame

Training data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

label : str, optional

Name of the dependent variable. Defaults to the last column.

predict(data, key, features=None, verbose=None, block_size=None, missing_replacement=None)

Predict dependent variable values based on fitted model.

Parameters:
data : DataFrame

Independent variable values to predict for.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

block_size : int, optional

The number of rows loaded per time during prediction. 0 indicates load all data at once. Defaults to 0.

missing_replacement : str, optional
The missing replacement strategy:
  • ‘feature_marginalized’: marginalise each missing feature out independently.
  • ‘instance_marginalized’: marginalise all missing features in an instance as a whole corresponding to each category.

Defaults to feature_marginalized.

verbose : bool, optional

If true, output all classes and the corresponding confidences for each data point.

Returns:
DataFrame
DataFrame of score and confidence, structured as follows:
  • ID column, with same name and type as data’s ID column.
  • SCORE, type DOUBLE, representing the predicted classes.
  • CONFIDENCE, type DOUBLE, representing the confidence of a class.
score(data, key, features=None, label=None, block_size=None, missing_replacement=None)

Returns the mean accuracy on the given test data and labels.

Parameters:
data : DataFrame

Data on which to assess model performance.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

label : str, optional

Name of the dependent variable. Defaults to the last column.

block_size : int, optional

The number of rows loaded per time during prediction. 0 indicates load all data at once. Defaults to 0.

missing_replacement : str, optional
The missing replacement strategy:
  • ‘feature_marginalized’: marginalise each missing feature out independently.
  • ‘instance_marginalized’: marginalise all missing features in an instance as a whole corresponding to each category.

Defaults to feature_marginalized.

Returns:
float

Mean accuracy on the given test data and labels.

class hana_ml.algorithms.pal.trees.RandomForestRegressor(conn_context, n_estimators=100, max_features=None, max_depth=None, min_samples_leaf=None, split_threshold=None, calculate_oob=True, random_state=None, thread_ratio=None, allow_missing_dependent=True, categorical_variable=None, sample_fraction=None)

Bases: hana_ml.algorithms.pal.trees._RandomForestBase

Random forest model for regression.

Parameters:
conn_context : ConnectionContext

Connection to the HANA system.

n_estimators : int, optional

Specifies the number of trees in the random forest. Defaults to 100.

max_features : int, optional

Specifies the number of randomly selected splitting variables. Should not be larger than the number of input features. Defaults to sqrt(p) (for classification) or p/3(for regression), where p is the number of input features.

max_depth : int, optional

The maximum depth of a tree. By default it is unlimited.

min_samples_leaf : int, optional

Specifies the minimum number of records in a leaf. Defaults to 5 for regression.

split_threshold : float, optional

Specifies the stop condition: if the improvement value of the best split is less than this value, the tree stops growing. Defaults to 1e-5.

calculate_oob : bool, optional

If true, calculate the out-of-bag error. Defaults to True.

random_state : int, optional

Specifies the seed for random number generator. 0: Uses the current time (in seconds) as the seed. Others: Uses the specified value as the seed. Defaults to 0.

thread_ratio : float, optional

Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to heuristically determined.

allow_missing_dependent : bool, optional
Specifies if a missing target value is allowed.
  • False: Not allowed. An error occurs if a missing target is present.
  • True: Allowed. The datum with a missing target is removed.

Defaults to True.

categorical_variable : list of str, optional

Indicates features should be treated as categorical. The default behavior is: string: categorical, or integer and float: continuous. VALID only for integer variables; omitted otherwise. Default value detected from input data.

sample_fraction : float, optional

The fraction of data used for training. Assume there are n pieces of data, sample fraction is r, then n*r data is selected for training. Defaults to 1.0.

Examples

Input dataframe for training:

>>> df1.head(5).collect()
   ID         A         B         C         D       CLASS
0   0 -0.965679  1.142985 -0.019274 -1.598807  -23.633813
1   1  2.249528  1.459918  0.153440 -0.526423  212.532559
2   2 -0.631494  1.484386 -0.335236  0.354313   26.342585
3   3 -0.967266  1.131867 -0.684957 -1.397419  -62.563666
4   4 -1.175179 -0.253179 -0.775074  0.996815 -115.534935

Creating RandomForestRegressor instance:

>>> rfr = RandomForestRegressor(conn_context=cc, random_state=3)

Performing fit() on given dataframe:

>>> rfr.fit(df1, key='ID')
>>> rfr.feature_importances_.collect()
   VARIABLE_NAME  IMPORTANCE
0             A    0.249593
1             B    0.381879
2             C    0.291403
3             D    0.077125

Input dataframe for predicting:

>>> df2.collect()
   ID         A         B         C         D
0   0  1.081277  0.204114  1.220580 -0.750665
1   1  0.524813 -0.012192 -0.418597  2.946886

Performing predict() on given dataframe:

>>> result = rfr.predict(df2, key='ID')
>>> result.collect()
   ID    SCORE  CONFIDENCE
0   0    48.126   62.952884
1   1  -10.9017   73.461039

Input dataframe for scoring:

>>> df3.head(5).collect()
    ID         A         B         C         D       CLASS
0    0  1.081277  0.204114  1.220580 -0.750665   139.10170
1    1  0.524813 -0.012192 -0.418597  2.946886    52.17203
2    2 -0.280871  0.100554 -0.343715 -0.118843   -34.69829
3    3 -0.113992 -0.045573  0.957154  0.090350    51.93602
4    4  0.287476  1.266895  0.466325 -0.432323   106.63425

Performing score() on given dataframe:

>>> rfr.score(df3, key='ID')
0.6530768858159514
Attributes:
model_ : DataFrame

Trained model content.

feature_importances_ : DataFrame

The feature importance (the higher, the more important the feature).

oob_error_ : DataFrame

Out-of-bag error rate or mean squared error for random forest up to indexed tree. Set to None if calculate_oob is False.

Methods

fit(data[, key, features, label]) Train the model on input data.
predict(data, key[, features, block_size, …]) Predict dependent variable values based on fitted model.
score(data, key[, features, label, …]) Returns the coefficient of determination R^2 of the prediction.
fit(data, key=None, features=None, label=None)

Train the model on input data.

Parameters:
data : DataFrame

Training data.

key : str, optional

Name of the ID column. If key is not provided, it is assumed that the input has no ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

label : str, optional

Name of the dependent variable. Defaults to the last column.

predict(data, key, features=None, block_size=None, missing_replacement=None)

Predict dependent variable values based on fitted model.

Parameters:
data : DataFrame

Independent variable values to predict for.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID columns.

block_size : int, optional

The number of rows loaded per time during prediction. 0 indicates load all data at once. Defaults to 0.

missing_replacement : str, optional
The missing replacement strategy:
  • ‘feature_marginalized’: marginalise each missing feature out independently.
  • ‘instance_marginalized’: marginalise all missing features in an instance as a whole corresponding to each category.

Defaults to feature_marginalized.

Returns:
DataFrame
DataFrame of score and confidence, structured as follows:
  • ID column, with same name and type as data’s ID column.
  • SCORE, type DOUBLE, representing the predicted values.
  • CONFIDENCE, all 0s. It is included due to the fact PAL uses the same table for classification.
score(data, key, features=None, label=None, block_size=None, missing_replacement=None)

Returns the coefficient of determination R^2 of the prediction.

Parameters:
data : DataFrame

Data on which to assess model performance.

key : str

Name of the ID column.

features : list of str, optional

Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.

label : str, optional

Name of the dependent variable. Defaults to the last column.

block_size : int, optional

The number of rows loaded per time during prediction. 0 indicates load all data at once. Defaults to 0.

missing_replacement : str, optional
The missing replacement strategy:
  • ‘feature_marginalized’: marginalise each missing feature out independently.
  • ‘instance_marginalized’: marginalise all missing features in an instance as a whole corresponding to each category.

Defaults to feature_marginalized.

Returns:
float

The coefficient of determination R^2 of the prediction on the given data.