hana_ml.algorithms.pal package¶
The Algorithms PAL Package consists of the following sections:
- hana_ml.algorithms.pal.clustering
- hana_ml.algorithms.pal.decomposition
- hana_ml.algorithms.pal.linear_model
- hana_ml.algorithms.pal.metrics
- hana_ml.algorithms.pal.mixture
- hana_ml.algorithms.pal.naive_bayes
- hana_ml.algorithms.pal.neighbors
- hana_ml.algorithms.pal.neural_network
- hana_ml.algorithms.pal.preprocessing
- hana_ml.algorithms.pal.regression
- hana_ml.algorithms.pal.stats
- hana_ml.algorithms.pal.svm
- hana_ml.algorithms.pal.trees
hana_ml.algorithms.pal.clustering¶
This module contains PAL wrapper and helper functions for clustering algorithms. The following classes are available:
-
class
hana_ml.algorithms.pal.clustering.
DBSCAN
(conn_context, minpts=None, eps=None, thread_ratio=None, metric=None, minkowski_power=None, categorical_variable=None, category_weights=None, algorithm=None, save_model=True)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
,hana_ml.algorithms.pal.clustering._ClusterAssignmentMixin
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based data clustering algorithm that finds a number of clusters starting from the estimated density distribution of corresponding nodes.
Parameters: - conn_context : ConnectionContext
Connection to the HANA system.
- minpts : int, optional
The minimum number of points required to form a cluster.
- eps : float, optional
The scan radius.
- thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to heuristically determined.
- metric : str, optional
Ways to compute the distance between two points.
- ‘manhattan’
- ‘euclidean’
- ‘minkowski’
- ‘chebyshev’
- ‘standardized_euclidean’
- ‘cosine’
Defaults to euclidean.
- minkowski_power : int, optional
When minkowski is choosed for metric, this parameter controls the value of power. Only applicable when metric is minkowski. Defaults to 3.
- categorical_variable : list of str, optional
Column names in the data table used as category variable.
- category_weights : float, optional
Represents the weight of category attributes. Defaults to 0.707.
- algorithm : str, optional
Ways to search for neighbours.
- ‘brute-force’
- ‘kd-tree’
Defaults to kd-tree.
- save_model : bool, optional
If true, the generated model will be saved. save_model must be True to call predict(). Defaults to True.
Examples
Input dataframe for clustering:
>>> df.collect() ID V1 V2 V3 0 1 0.10 0.10 B 1 2 0.11 0.10 A 2 3 0.10 0.11 C 3 4 0.11 0.11 B 4 5 0.12 0.11 A 5 6 0.11 0.12 E 6 7 0.12 0.12 A 7 8 0.12 0.13 C 8 9 0.13 0.12 D 9 10 0.13 0.13 D 10 11 0.13 0.14 A 11 12 0.14 0.13 C 12 13 10.10 10.10 A 13 14 10.11 10.10 F 14 15 10.10 10.11 E 15 16 10.11 10.11 E 16 17 10.11 10.12 A 17 18 10.12 10.11 B 18 19 10.12 10.12 B 19 20 10.12 10.13 D 20 21 10.13 10.12 F 21 22 10.13 10.13 A 22 23 10.13 10.14 A 23 24 10.14 10.13 D 24 25 4.10 4.10 A 25 26 7.11 7.10 C 26 27 -3.10 -3.11 C 27 28 16.11 16.11 A 28 29 20.11 20.12 C 29 30 15.12 15.11 A
Create DSBCAN instance:
>>> dbscan = DBSCAN(conn_context=cc, thread_ratio=0.2, metric='manhattan')
Perform fit on the given data:
>>> dbscan.fit(df, 'ID')
Expected output:
>>> dbscan.labels_.collect() ID CLUSTER_ID 0 1 0 1 2 0 2 3 0 3 4 0 4 5 0 5 6 0 6 7 0 7 8 0 8 9 0 9 10 0 10 11 0 11 12 0 12 13 1 13 14 1 14 15 1 15 16 1 16 17 1 17 18 1 18 19 1 19 20 1 20 21 1 21 22 1 22 23 1 23 24 1 24 25 -1 25 26 -1 26 27 -1 27 28 -1 28 29 -1 29 30 -1
Attributes: - labels_ : DataFrame
Label assigned to each sample.
- model_ : DataFrame
Model content. Set to None if save_model is False.
Methods
fit
(data, key[, features])Fit the DBSCAN model when given the training dataset. fit_predict
(data, key[, features])Fit with the dataset and return the labels. predict
(data, key[, features])Assign clusters to data based on a fitted model. -
fit
(data, key, features=None)¶ Fit the DBSCAN model when given the training dataset.
Parameters: - data : DataFrame
DataFrame containing the data.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the features columns. If features is not provided, it defaults to all the non-ID columns.
-
fit_predict
(data, key, features=None)¶ Fit with the dataset and return the labels.
Parameters: - data : DataFrame
DataFrame containing the data.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the features columns. If features is not provided, it defaults to all the non-ID columns.
Returns: - DataFrame
- Fit result, structured as follows:
- ID column, with the same name and type as data’s ID column.
- CLUSTER_ID, type INTEGER, cluster ID assigned to the data point. (Cluster IDs range from 0 to 1 less than the number of clusters. A cluster ID of -1 means the point is labeled as noise.)
-
predict
(data, key, features=None)¶ Assign clusters to data based on a fitted model.
The output structure of this method does not match that of fit_predict().
Parameters: - data : DataFrame
Data points to match against computed clusters. This dataframe’s column structure should match that of the data used for fit().
- key : str
Name of the ID column.
- features : list of str, optional.
Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.
Returns: - DataFrame
- Cluster assignment results, with 3 columns:
- Data point ID, with name and type taken from the input ID column.
- CLUSTER_ID, type INTEGER, representing the cluster the data point is assigned to.
- DISTANCE, type DOUBLE, representing the distance between the data point and the nearest core point.
-
class
hana_ml.algorithms.pal.clustering.
KMeans
(conn_context, n_clusters=None, n_clusters_min=None, n_clusters_max=None, init=None, max_iter=None, thread_ratio=None, distance_level=None, minkowski_power=None, category_weights=None, normalization=None, categorical_variable=None, tol=None, memory_mode=None, accelerated=False)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
,hana_ml.algorithms.pal.clustering._ClusterAssignmentMixin
K-Means model that handles clustering problems.
Parameters: - conn_context : ConnectionContext
Connection to the HANA system.
- n_clusters : int, optional
Number of clusters. If this parameter is not specified, you must specify the minimum and maximum range parameters instead.
- n_clusters_min : int, optional
Cluster range minimum.
- n_clusters_max : int, optional
Cluster range maximum.
- init : {‘first_k’, ‘replace’, ‘no_replace’, ‘patent’}, optional
Controls how the initial centers are selected:
- ‘first_k’: First k observations.
- ‘replace’: Random with replacement.
- ‘no_replace’: Random without replacement.
- ‘patent’: Patent of selecting the init center (US 6,882,998 B1).
Defaults to patent.
- max_iter : int, optional
Max iterations. Defaults to 100.
- thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.
- distance_level : str, optional
Ways to compute the distance between the item and the cluster center.
- ‘manhattan’
- ‘euclidean’
- ‘minkowski’
- ‘chebyshev’
- ‘cosine’
Defaults to euclidean. ‘cosine’ is only valid when accelerated is False.
- minkowski_power : float, optional
When Minkowski distance is used, this parameter controls the value of power. Only valid when distance_level is minkowski. Defaults to 3.0.
- category_weights : float, optional
Represents the weight of category attributes. Defaults to 0.707.
- normalization : {‘no’, ‘l1_norm’, ‘min_max’}, optional
Normalization type.
- ‘no’: No normalization will be applied.
- ‘l1_norm’: Yes, for each point X (x1,x2,…,xn), the normalized value will be X’(x1/S,x2/S,…,xn/S), where S = |x1|+|x2|+...|xn|.
- ‘min_max’: Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).
Defaults to no.
- categorical_variable : list of str, optional
Column names in the data table used as category variable.
- tol : float, optional
Convergence threshold for exiting iterations. Only valid when accelerated is False. Defaults to 1.0e-6.
- memory_mode : {‘auto’, ‘optimize-speed’, ‘optimize-space’}, optional
Indicates the memory mode that the algorithm uses.
- ‘auto’: Chosen by algorithm.
- ‘optimize-speed’: Prioritizes speed.
- ‘optimize-space’: Prioritizes memory.
Only valid when accelerated is True. Defaults to auto.
- accelerated : bool, optional
Indicates whether to use technology like cache to accelerate the calculation process. If True, the calculation process will be accelerated. If False, the calculation process will not be accelerated. Defaults to False.
Examples
Input dataframe for clustering:
>>> df.collect() ID V000 V001 V002 0 0 0.5 A 0.5 1 1 1.5 A 0.5 2 2 1.5 A 1.5 3 3 0.5 A 1.5 4 4 1.1 B 1.2 5 5 0.5 B 15.5 6 6 1.5 B 15.5 7 7 1.5 B 16.5 8 8 0.5 B 16.5 9 9 1.2 C 16.1 10 10 15.5 C 15.5 11 11 16.5 C 15.5 12 12 16.5 C 16.5 13 13 15.5 C 16.5 14 14 15.6 D 16.2 15 15 15.5 D 0.5 16 16 16.5 D 0.5 17 17 16.5 D 1.5 18 18 15.5 D 1.5 19 19 15.7 A 1.6
Create KMeans instance:
>>> km = clustering.KMeans(conn_context=cc, n_clusters=4, init='first_k', ... max_iter=100, tol=1.0E-6, thread_ratio=0.2, ... distance_level='Euclidean', ... category_weights=0.5)
Perform fit_predict:
>>> labels = km.fit_predict(df, 'ID') >>> labels.collect() ID CLUSTER_ID DISTANCE SLIGHT_SILHOUETTE 0 0 0 0.891088 0.944370 1 1 0 0.863917 0.942478 2 2 0 0.806252 0.946288 3 3 0 0.835684 0.944942 4 4 0 0.744571 0.950234 5 5 3 0.891088 0.940733 6 6 3 0.835684 0.944412 7 7 3 0.806252 0.946519 8 8 3 0.863917 0.946121 9 9 3 0.744571 0.949899 10 10 2 0.825527 0.945092 11 11 2 0.933886 0.937902 12 12 2 0.881692 0.945008 13 13 2 0.764318 0.949160 14 14 2 0.923456 0.939283 15 15 1 0.901684 0.940436 16 16 1 0.976885 0.939386 17 17 1 0.818178 0.945878 18 18 1 0.722799 0.952170 19 19 1 1.102342 0.925679 >>> df = cc.table("PAL_ACCKMEANS_DATA_TBL") >>> df.collect() ID V000 V001 V002 0 0 0.5 A 0 1 1 1.5 A 0 2 2 1.5 A 1 3 3 0.5 A 1 4 4 1.1 B 1 5 5 0.5 B 15 6 6 1.5 B 15 7 7 1.5 B 16 8 8 0.5 B 16 9 9 1.2 C 16 10 10 15.5 C 15 11 11 16.5 C 15 12 12 16.5 C 16 13 13 15.5 C 16 14 14 15.6 D 16 15 15 15.5 D 0 16 16 16.5 D 0 17 17 16.5 D 1 18 18 15.5 D 1 19 19 15.7 A 1
Create Accelerated Kmeans instance:
>>> akm = clustering.KMeans(conn_context=cc, init='first_k', ... thread_ratio=0.5, n_clusters=4, ... distance_level='euclidean', ... max_iter=100, category_weights=0.5, ... categorical_variable=['V002'], ... accelerated=True)
Perform fit_predict:
>>> labels = akm.fit_predict(df, 'ID') >>> labels.collect() ID CLUSTER_ID DISTANCE SLIGHT_SILHOUETTE 0 0 0 1.198938 0.006767 1 1 0 1.123938 0.068899 2 2 3 0.500000 0.572506 3 3 3 0.500000 0.598267 4 4 0 0.621517 0.229945 5 5 0 1.037500 0.308333 6 6 0 0.962500 0.358333 7 7 0 0.895513 0.402992 8 8 0 0.970513 0.352992 9 9 0 0.823938 0.313385 10 10 1 1.038276 0.931555 11 11 1 1.178276 0.927130 12 12 1 1.135685 0.929565 13 13 1 0.995685 0.934165 14 14 1 0.849615 0.944359 15 15 1 0.995685 0.934548 16 16 1 1.135685 0.929950 17 17 1 1.089615 0.932769 18 18 1 0.949615 0.937555 19 19 1 0.915565 0.937717
Attributes: - labels_ : DataFrame
Label assigned to each sample.
- cluster_centers_ : DataFrame
Coordinates of cluster centers.
- model_ : DataFrame
Model content.
- statistics_ : DataFrame
Statistic value.
Methods
fit
(data, key[, features])Fit the model when given training dataset. fit_predict
(data, key[, features])Fit with the dataset and return the labels. predict
(data, key[, features])Assign clusters to data based on a fitted model. -
fit
(data, key, features=None)¶ Fit the model when given training dataset.
Parameters: - data : DataFrame
DataFrame containing the data.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.
-
fit_predict
(data, key, features=None)¶ Fit with the dataset and return the labels.
Parameters: - data : DataFrame
DataFrame containing the data.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.
Returns: - DataFrame
- Fit result, structured as follows:
- ID column, with the same name and type as data’s ID column.
- CLUSTER_ID, type INTEGER, cluster ID assigned to the data point.
- DISTANCE, type DOUBLE, the distance between the given point and the cluster center.
- SLIGHT_SILHOUETTE, type DOUBLE, estimated value (slight silhouette).
-
predict
(data, key, features=None)¶ Assign clusters to data based on a fitted model.
The output structure of this method does not match that of fit_predict().
Parameters: - data : DataFrame
Data points to match against computed clusters. This dataframe’s column structure should match that of the data used for fit().
- key : str
Name of the ID column.
- features : list of str, optional.
Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.
Returns: - DataFrame
- Cluster assignment results, with 3 columns:
- Data point ID, with name and type taken from the input ID column.
- CLUSTER_ID, INTEGER type, representing the cluster the data point is assigned to.
- DISTANCE, DOUBLE type, representing the distance between the data point and the cluster center.
-
class
hana_ml.algorithms.pal.clustering.
KMedians
(conn_context, n_clusters, init=None, max_iter=None, tol=None, thread_ratio=None, distance_level=None, minkowski_power=None, category_weights=None, normalization=None, categorical_variable=None)¶ Bases:
hana_ml.algorithms.pal.clustering._KClusteringBase
K-Medians clustering algorithm that partitions n observations into K clusters according to their nearest cluster center. It uses medians of each feature to calculate cluster centers.
Parameters: - conn_context : ConnectionContext
Database connection object.
- n_clusters : int
Number of groups.
- init : {‘first_k’, ‘replace’, ‘no_replace’, ‘patent’}, optional
Controls how the initial centers are selected:
- ‘first_k’: First k observations.
- ‘replace’: Random with replacement.
- ‘no_replace’: Random without replacement.
- ‘patent’: Patent of selecting the init center (US 6,882,998 B1).
Defaults to patent.
- max_iter : int, optional
Max iterations. Defaults to 100.
- tol : float, optional
Convergence threshold for exiting iterations. Defaults to 1.0e-6.
- thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.
- distance_level : str, optional
Ways to compute the distance between the item and the cluster center.
- ‘manhattan’
- ‘euclidean’
- ‘minkowski’
- ‘chebyshev’
- ‘cosine’
Defaults to euclidean.
- minkowski_power : float, optional
When Minkowski distance is used, this parameter controls the value of power. Only valid when distance_level is minkowski. Defaults to 3.0.
- category_weights : float, optional
Represents the weight of category attributes. Defaults to 0.707.
- normalization : {‘no’, ‘l1_norm’, ‘min_max’}, optional
Normalization type.
- ‘no’: No, normalization will not be applied.
- ‘l1_norm’: Yes, for each point X (x1,x2,…,xn), the normalized value will be X’(x1/S,x2/S,…,xn/S), where S = |x1|+|x2|+...|xn|.
- ‘min_max’: Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).
Defaults to no.
- categorical_variable : list of str, optional
Column names in the data table used as category variable.
Examples
Input dataframe for clustering:
>>> df1.collect() ID V000 V001 V002 0 0 0.5 A 0.5 1 1 1.5 A 0.5 2 2 1.5 A 1.5 3 3 0.5 A 1.5 4 4 1.1 B 1.2 5 5 0.5 B 15.5 6 6 1.5 B 15.5 7 7 1.5 B 16.5 8 8 0.5 B 16.5 9 9 1.2 C 16.1 10 10 15.5 C 15.5 11 11 16.5 C 15.5 12 12 16.5 C 16.5 13 13 15.5 C 16.5 14 14 15.6 D 16.2 15 15 15.5 D 0.5 16 16 16.5 D 0.5 17 17 16.5 D 1.5 18 18 15.5 D 1.5 19 19 15.7 A 1.6
Creating KMedians instance:
>>> kmedians = KMedians(conn_context=cc, n_clusters=4, init='first_k', ... max_iter=100, tol=1.0E-6, ... distance_level='Euclidean', ... thread_ratio=0.3, category_weights=0.5)
Performing fit() on given dataframe:
>>> kmedians.fit(df1, 'ID') >>> kmedians.cluster_centers_.collect() CLUSTER_ID V000 V001 V002 0 0 1.1 A 1.2 1 1 15.7 D 1.5 2 2 15.6 C 16.2 3 3 1.2 B 16.1
Performing fit_predict() on given dataframe:
>>> kmedians.fit_predict(df1, 'ID').collect() ID CLUSTER_ID DISTANCE 0 0 0 0.921954 1 1 0 0.806226 2 2 0 0.500000 3 3 0 0.670820 4 4 0 0.707107 5 5 3 0.921954 6 6 3 0.670820 7 7 3 0.500000 8 8 3 0.806226 9 9 3 0.707107 10 10 2 0.707107 11 11 2 1.140175 12 12 2 0.948683 13 13 2 0.316228 14 14 2 0.707107 15 15 1 1.019804 16 16 1 1.280625 17 17 1 0.800000 18 18 1 0.200000 19 19 1 0.807107
Attributes: - cluster_centers_ : DataFrame
Coordinates of cluster centers.
- labels_ : DataFrame
Cluster assignment and distance to cluster center for each point.
Methods
fit
(data, key[, features])Perform clustering on input dataset. fit_predict
(data, key[, features])Perform clustering algorithm and return labels. -
fit
(data, key, features=None)¶ Perform clustering on input dataset.
Parameters: - data : DataFrame
DataFrame contains input data.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.
-
fit_predict
(data, key, features=None)¶ Perform clustering algorithm and return labels.
Parameters: - data : DataFrame
DataFrame containing input data.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.
Returns: - DataFrame
- Fit result, structured as follows:
- ID column, with the same name and type as data’s ID column.
- CLUSTER_ID, type INTEGER, cluster ID assigned to the data point.
- DISTANCE, type DOUBLE, the distance between the given point and the cluster center.
-
class
hana_ml.algorithms.pal.clustering.
KMedoids
(conn_context, n_clusters, init=None, max_iter=None, tol=None, thread_ratio=None, distance_level=None, minkowski_power=None, category_weights=None, normalization=None, categorical_variable=None)¶ Bases:
hana_ml.algorithms.pal.clustering._KClusteringBase
K-Medoids clustering algorithm that partitions n observations into K clusters according to their nearest cluster center. It uses medoids to calculate cluster centers. K-Medoids is more robust to noise and outliers.
Parameters: - conn_context : ConnectionContext
Database connection object.
- n_clusters : int
Number of groups.
- init : {‘first_k’, ‘replace’, ‘no_replace’, ‘patent’}, optional
Controls how the initial centers are selected:
- ‘first_k’: First k observations.
- ‘replace’: Random with replacement.
- ‘no_replace’: Random without replacement.
- ‘patent’: Patent of selecting the init center (US 6,882,998 B1).
Defaults to patent.
- max_iter : int, optional
Max iterations. Defaults to 100.
- tol : float, optional
Convergence threshold for exiting iterations. Defaults to 1.0e-6.
- thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.
- distance_level : str, optional
Ways to compute the distance between the item and the cluster center.
- ‘manhattan’
- ‘euclidean’
- ‘minkowski’
- ‘chebyshev’
- ‘cosine’
Defaults to euclidean.
- minkowski_power : float, optional
When Minkowski distance is used, this parameter controls the value of power. Only valid when distance_level is minkowski. Defaults to 3.0.
- category_weights : float, optional
Represents the weight of category attributes. Defaults to 0.707.
- normalization : {‘no’, ‘l1_norm’, ‘min_max’}, optional
Normalization type.
- ‘no’: No, normalization will not be applied.
- ‘l1_norm’: Yes, for each point X (x1,x2,…,xn), the normalized value will be X’(x1/S,x2/S,…,xn/S), where S = |x1|+|x2|+...|xn|.
- ‘min_max’: Yes, for each column C, get the min and max value of C, and then C[i] = (C[i]-min)/(max-min).
Defaults to no.
- categorical_variable : list of str, optional
Column names in the data table used as category variable.
Examples
Input dataframe for clustering:
>>> df1.collect() ID V000 V001 V002 0 0 0.5 A 0.5 1 1 1.5 A 0.5 2 2 1.5 A 1.5 3 3 0.5 A 1.5 4 4 1.1 B 1.2 5 5 0.5 B 15.5 6 6 1.5 B 15.5 7 7 1.5 B 16.5 8 8 0.5 B 16.5 9 9 1.2 C 16.1 10 10 15.5 C 15.5 11 11 16.5 C 15.5 12 12 16.5 C 16.5 13 13 15.5 C 16.5 14 14 15.6 D 16.2 15 15 15.5 D 0.5 16 16 16.5 D 0.5 17 17 16.5 D 1.5 18 18 15.5 D 1.5 19 19 15.7 A 1.6
Creating KMedoids instance:
>>> kmedoids = KMedoids(conn_context=cc, n_clusters=4, init='first_K', ... max_iter=100, tol=1.0E-6, ... distance_level='Euclidean', ... thread_ratio=0.3, category_weights=0.5)
Performing fit() on given dataframe:
>>> kmedoids.fit(df1, 'ID') >>> kmedoids.cluster_centers_.collect() CLUSTER_ID V000 V001 V002 0 0 1.5 A 1.5 1 1 15.5 D 1.5 2 2 15.5 C 16.5 3 3 1.5 B 16.5
Performing fit_predict() on given dataframe:
>>> kmedoids.fit_predict(df1, 'ID').collect() ID CLUSTER_ID DISTANCE 0 0 0 1.414214 1 1 0 1.000000 2 2 0 0.000000 3 3 0 1.000000 4 4 0 1.207107 5 5 3 1.414214 6 6 3 1.000000 7 7 3 0.000000 8 8 3 1.000000 9 9 3 1.207107 10 10 2 1.000000 11 11 2 1.414214 12 12 2 1.000000 13 13 2 0.000000 14 14 2 1.023335 15 15 1 1.000000 16 16 1 1.414214 17 17 1 1.000000 18 18 1 0.000000 19 19 1 0.930714
Attributes: - cluster_centers_ : DataFrame
Coordinates of cluster centers.
- labels_ : DataFrame
Cluster assignment and distance to cluster center for each point.
Methods
fit
(data, key[, features])Perform clustering on input dataset. fit_predict
(data, key[, features])Perform clustering algorithm and return labels. -
fit
(data, key, features=None)¶ Perform clustering on input dataset.
Parameters: - data : DataFrame
DataFrame contains input data.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.
-
fit_predict
(data, key, features=None)¶ Perform clustering algorithm and return labels.
Parameters: - data : DataFrame
DataFrame containing input data.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.
Returns: - DataFrame
- Fit result, structured as follows:
- ID column, with the same name and type as data’s ID column.
- CLUSTER_ID, type INTEGER, cluster ID assigned to the data point.
- DISTANCE, type DOUBLE, the distance between the given point and the cluster center.
hana_ml.algorithms.pal.decomposition¶
This module contains PAL wrappers for decomposition algorithms.
The following classes are available:
-
class
hana_ml.algorithms.pal.decomposition.
LatentDirichletAllocation
(conn_context, n_components, doc_topic_prior=None, topic_word_prior=None, burn_in=None, iteration=None, thin=None, seed=None, max_top_words=None, threshold_top_words=None, gibbs_init=None, delimiters=None, output_word_assignment=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
Latent Dirichlet allocation (LDA) is a generative model in which each item (word) of a collection (document) is generated from a finite mixture over several latent groups (topics).
Parameters: - conn_context : ConnectionContext
Connection to the HANA system.
- n_components : int
Expected number of topics in the corpus.
- doc_topic_prior : float, optional
Specifies the prior weight related to document-topic distribution. Defaults to 50/n_components.
- topic_word_prior : float, optional
Specifies the prior weight related to topic-word distribution. Defaults to 0.1.
- burn_in : int, optional
Number of omitted Gibbs iterations at the beginning. Generally, samples from the beginning may not accurately represent the desired distribution and are usually discarded. Defaults to 0.
- iteration : int, optional
Number of Gibbs iterations. Defaults to 2000.
- thin : int, optional
Number of omitted in-between Gibbs iterations. Value must be greater than 0. Defaults to 1.
- seed : int, optional
Indicates the seed used to initialize the random number generator:
- 0: Uses the system time.
- Not 0: Uses the provided value.
Defaults to 0.
- max_top_words : int, optional
Specifies the maximum number of words to be output for each topic. Defaults to 0.
- threshold_top_words : float, optional
The algorithm outputs top words for each topic if the probability is larger than this threshold. It cannot be used together with parameter max_top_words.
- gibbs_init : str, optional
Specifies initialization method for Gibbs sampling:
- ‘uniform’: Assign each word in each document a topic by uniform distribution.
- ‘gibbs’: Assign each word in each document a topic by one round of Gibbs sampling using doc_topic_prior and topic_word_prior.
Defaults to ‘uniform’.
- delimiters : list of str, optional
Specifies the set of delimiters to separate words in a document. Each delimiter must be one character long. Defaults to [‘ ‘].
- output_word_assignment : bool, optional
Controls whether to output the word_topic_assignment_ or not. If True, output the word_topic_assignment_. Defaults to False.
Notes
- Parameters max_top_words and threshold_top_words cannot be used together.
- Parameters burn_in, thin, iteration, seed, gibbs_init and delimiters set in transform() will take precedence over the corresponding ones in __init__().
Examples
Input dataframe for training:
>>> df1.collect() DOCUMENT_ID TEXT 0 10 cpu harddisk graphiccard cpu monitor keyboard ... 1 20 tires mountainbike wheels valve helmet mountai... 2 30 carseat toy strollers toy toy spoon toy stroll... 3 40 sweaters sweaters sweaters boots sweaters ring...
Creating LDA instance:
>>> lda = LatentDirichletAllocation(cc, n_components=6, burn_in=50, thin=10, ... iteration=100, seed=1, ... max_top_words=5, doc_topic_prior=0.1, ... output_word_assignment=True, ... delimiters=[' ', '\r', '\n'])
Performing fit() on given dataframe:
>>> lda.fit(df1, 'DOCUMENT_ID', 'TEXT') >>> lda.doc_topic_dist_.collect() DOCUMENT_ID TOPIC_ID PROBABILITY 0 10 0 0.010417 1 10 1 0.010417 2 10 2 0.010417 3 10 3 0.010417 4 10 4 0.947917 5 10 5 0.010417 6 20 0 0.009434 7 20 1 0.009434 8 20 2 0.009434 9 20 3 0.952830 10 20 4 0.009434 11 20 5 0.009434 12 30 0 0.103774 13 30 1 0.858491 14 30 2 0.009434 15 30 3 0.009434 16 30 4 0.009434 17 30 5 0.009434 18 40 0 0.009434 19 40 1 0.009434 20 40 2 0.952830 21 40 3 0.009434 22 40 4 0.009434 23 40 5 0.009434
>>> lda.word_topic_assignment_.collect() DOCUMENT_ID WORD_ID TOPIC_ID 0 10 0 4 1 10 1 4 2 10 2 4 3 10 0 4 4 10 3 4 5 10 4 4 6 10 0 4 7 10 5 4 8 10 5 4 9 20 6 3 10 20 7 3 11 20 8 3 12 20 9 3 13 20 10 3 14 20 7 3 15 20 11 3 16 20 6 3 17 20 7 3 18 20 7 3 19 30 12 1 20 30 13 1 21 30 14 1 22 30 13 1 23 30 13 1 24 30 15 0 25 30 13 1 26 30 14 1 27 30 13 1 28 30 12 1 29 40 16 2 30 40 16 2 31 40 16 2 32 40 17 2 33 40 16 2 34 40 18 2 35 40 19 2 36 40 19 2 37 40 20 2 38 40 16 2
>>> lda.topic_top_words_.collect() TOPIC_ID WORDS 0 0 spoon strollers tires graphiccard valve 1 1 toy strollers carseat graphiccard cpu 2 2 sweaters vest shoe rings boots 3 3 mountainbike tires rearfender helmet valve 4 4 cpu memory graphiccard keyboard harddisk 5 5 strollers tires graphiccard cpu valve
>>> lda.topic_word_dist_.head(40).collect() TOPIC_ID WORD_ID PROBABILITY 0 0 0 0.050000 1 0 1 0.050000 2 0 2 0.050000 3 0 3 0.050000 4 0 4 0.050000 5 0 5 0.050000 6 0 6 0.050000 7 0 7 0.050000 8 0 8 0.550000 9 0 9 0.050000 10 1 0 0.050000 11 1 1 0.050000 12 1 2 0.050000 13 1 3 0.050000 14 1 4 0.050000 15 1 5 0.050000 16 1 6 0.050000 17 1 7 0.050000 18 1 8 0.050000 19 1 9 0.550000 20 2 0 0.025000 21 2 1 0.025000 22 2 2 0.525000 23 2 3 0.025000 24 2 4 0.025000 25 2 5 0.025000 26 2 6 0.025000 27 2 7 0.275000 28 2 8 0.025000 29 2 9 0.025000 30 3 0 0.014286 31 3 1 0.014286 32 3 2 0.014286 33 3 3 0.585714 34 3 4 0.157143 35 3 5 0.014286 36 3 6 0.157143 37 3 7 0.014286 38 3 8 0.014286 39 3 9 0.014286
>>> lda.dictionary_.collect() WORD_ID WORD 0 17 boots 1 12 carseat 2 0 cpu 3 2 graphiccard 4 1 harddisk 5 10 helmet 6 4 keyboard 7 5 memory 8 3 monitor 9 7 mountainbike 10 11 rearfender 11 18 rings 12 20 shoe 13 15 spoon 14 14 strollers 15 16 sweaters 16 6 tires 17 13 toy 18 9 valve 19 19 vest 20 8 wheels
>>> lda.statistic_.collect() STAT_NAME STAT_VALUE 0 DOCUMENTS 4 1 VOCABULARY_SIZE 21 2 LOG_LIKELIHOOD -64.95765414596762
Dataframe to transform:
>>> df2.collect() DOCUMENT_ID TEXT 0 10 toy toy spoon cpu
Performing transform on the given dataframe:
>>> res = lda.transform(df2, 'DOCUMENT_ID', 'TEXT', burn_in=2000, thin=100, ... iteration=1000, seed=1, ... output_word_assignment=True)
>>> doc_top_df, word_top_df, stat_df = res
>>> doc_top_df.collect() DOCUMENT_ID TOPIC_ID PROBABILITY 0 10 0 0.239130 1 10 1 0.456522 2 10 2 0.021739 3 10 3 0.021739 4 10 4 0.239130 5 10 5 0.021739
>>> word_top_df.collect() DOCUMENT_ID WORD_ID TOPIC_ID 0 10 13 1 1 10 13 1 2 10 15 0 3 10 0 4
>>> stat_df.collect() STAT_NAME STAT_VALUE 0 DOCUMENTS 1 1 VOCABULARY_SIZE 21 2 LOG_LIKELIHOOD -7.925092991875363 3 PERPLEXITY 7.251970666272191
Attributes: - doc_topic_dist_ : DataFrame
- DOCUMENT_TOPIC_DISTRIBUTION table, structured as follows:
- Document ID column, with same name and type as data’s document ID column from fit().
- TOPIC_ID, type INTEGER, topic ID.
- PROBABILITY, type DOUBLE, probability of topic given document.
- word_topic_assignment_ : DataFrame
- WORD_TOPIC_ASSIGNMENT table, structured as follows:
- Document ID column, with same name and type as data’s document ID column from fit().
- WORD_ID, type INTEGER, word ID.
- TOPIC_ID, type INTEGER, topic ID.
Set to None if output_word_assignment is set to False.
- topic_top_words_ : DataFrame
- TOPIC_TOP_WORDS table, structured as follows:
- TOPIC_ID, type INTEGER, topic ID.
- WORDS, type NVARCHAR(5000), topic top words separated by spaces.
Set to None if neither max_top_words nor threshold_top_words is provided.
- topic_word_dist_ : DataFrame
- TOPIC_WORD_DISTRIBUTION table, structured as follows:
- TOPIC_ID, type INTEGER, topic ID.
- WORD_ID, type INTEGER, word ID.
- PROBABILITY, type DOUBLE, probability of word given topic.
- dictionary_ : DataFrame
- DICTIONARY table, structured as follows:
- WORD_ID, type INTEGER, word ID.
- WORD, type NVARCHAR(5000), word text.
- statistic_ : DataFrame
- STATISTICS table, structured as follows:
- STAT_NAME, type NVARCHAR(256), statistic name.
- STAT_VALUE, type NVARCHAR(1000), statistic value.
Methods
fit
(data, key[, document])Fit LDA model based on training data. fit_transform
(data, key[, document])Fit LDA model based on training data and return the topic assignment for the training documents. transform
(data, key[, document, burn_in, …])Transform the topic assignment for new documents based on the previous LDA estimation results. -
fit
(data, key, document=None)¶ Fit LDA model based on training data.
Parameters: - data : DataFrame
Training data.
- key : str
Name of the document ID column.
- document : str, optional
Name of the document text column. If document is not provided, data must have exactly 1 non-key column, and document defaults to that column.
-
fit_transform
(data, key, document=None)¶ Fit LDA model based on training data and return the topic assignment for the training documents.
Parameters: - data : DataFrame
Training data.
- key : str
Name of the document ID column.
- document : str, optional
Name of the document text column. If document is not provided, data must have exactly 1 non-key column, and document defaults to that column.
Returns: - doc_topic_df : DataFrame
- DOCUMENT_TOPIC_DISTRIBUTION table, structured as follows:
- Document ID column, with same name and type as data’s document ID column.
- TOPIC_ID, type INTEGER, topic ID.
- PROBABILITY, type DOUBLE, probability of topic given document.
-
transform
(data, key, document=None, burn_in=None, iteration=None, thin=None, seed=None, gibbs_init=None, delimiters=None, output_word_assignment=None)¶ Transform the topic assignment for new documents based on the previous LDA estimation results.
Parameters: - data : DataFrame
Independent variable values used for tranform.
- key : str
Name of the document ID column.
- document : str, optional
Name of the document text column. If document is not provided, data must have exactly 1 non-key column, and document defaults to that column.
- burn_in : int, optional
Number of omitted Gibbs iterations at the beginning. Generally, samples from the beginning may not accurately represent the desired distribution and are usually discarded. Defaults to 0 if not set in __init__().
- iteration : int, optional
Numbers of Gibbs iterations. Defaults to 2000 if not set in __init__().
- thin : int, optional
Number of omitted in-between Gibbs iterations. Defaults to 1 if not set in __init__().
- seed : int, optional
Indicates the seed used to initialize the random number generator:
- 0: Uses the system time.
- Not 0: Uses the provided value.
Defaults to 0 if not set in __init__().
- gibbs_init : str, optional
Specifies initialization method for Gibbs sampling:
- ‘uniform’: Assign each word in each document a topic by uniform distribution.
- ‘gibbs’: Assign each word in each document a topic by one round of Gibbs sampling using doc_topic_prior and topic_word_prior.
Defaults to ‘uniform’ if not set in __init__().
- delimiters : list of str, optional
Specifies the set of delimiters to separate words in a document. Each delimiter must be one character long. Defaults to [‘ ‘] if not set in __init__().
- output_word_assignment : bool, optional
Controls whether to output the word_topic_df or not. If True, output the word_topic_df. Defaults to False.
Returns: - doc_topic_df : DataFrame
- DOCUMENT_TOPIC_DISTRIBUTION table, structured as follows:
- Document ID column, with same name and type as data’s document ID column.
- TOPIC_ID, type INTEGER, topic ID.
- PROBABILITY, type DOUBLE, probability of topic given document.
- word_topic_df : DataFrame
- WORD_TOPIC_ASSIGNMENT table, structured as follows:
- Document ID column, with same name and type as data’s document ID column.
- WORD_ID, type INTEGER, word ID.
- TOPIC_ID, type INTEGER, topic ID.
Set to None if output_word_assignment is False.
- stat_df : DataFrame
- STATISTICS table, structured as follows:
- STAT_NAME, type NVARCHAR(256), statistic name.
- STAT_VALUE, type NVARCHAR(1000), statistic value.
-
class
hana_ml.algorithms.pal.decomposition.
PCA
(conn_context, scaling=None, thread_ratio=None, scores=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
Principal component analysis procedure to reduce the dimensionality of multivariate data using Singular Value Decomposition.
Parameters: - conn_context : ConnectionContext
Connection to the HANA system.
- thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to heuristically determined.
- scaling : bool, optional
If true, scale variables to have unit variance before the analysis takes place. Defaults to False.
- scores : bool, optional
If true, output the scores on each principal component when fitting. Defaults to False.
Notes
Variables cannot be scaled if there exists one variable which has constant value across data items.
Examples
Input dataframe for training:
>>> df1.head(4).collect() ID X1 X2 X3 X4 0 1 12.0 52.0 20.0 44.0 1 2 12.0 57.0 25.0 45.0 2 3 12.0 54.0 21.0 45.0 3 4 13.0 52.0 21.0 46.0
Creating PCA instance:
>>> pca = PCA(cc, scaling=True, thread_ratio=0.5, scores=True)
Performing fit() on given dataframe:
>>> pca.fit(df1, key='ID') >>> pca.loadings_.collect() COMPONENT_ID LOADINGS_X1 LOADINGS_X2 LOADINGS_X3 LOADINGS_X4 0 Comp1 0.541547 0.321424 0.511941 0.584235 1 Comp2 -0.454280 0.728287 0.395819 -0.326429 2 Comp3 -0.171426 -0.600095 0.760875 -0.177673 3 Comp4 -0.686273 -0.078552 -0.048095 0.721489
>>> pca.loadings_stat_.collect() COMPONENT_ID SD VAR_PROP CUM_VAR_PROP 0 Comp1 1.566624 0.613577 0.613577 1 Comp2 1.100453 0.302749 0.916327 2 Comp3 0.536973 0.072085 0.988412 3 Comp4 0.215297 0.011588 1.000000
>>> pca.scaling_stat_.collect() VARIABLE_ID MEAN SCALE 0 1 17.000000 5.039841 1 2 53.636364 1.689540 2 3 23.000000 2.000000 3 4 48.454545 4.655398
Input dataframe for transforming:
>>> df2.collect() ID X1 X2 X3 X4 0 1 2.0 32.0 10.0 54.0 1 2 9.0 57.0 20.0 25.0 2 3 12.0 24.0 28.0 35.0 3 4 15.0 42.0 27.0 36.0
Performing transform() on given dataframe:
>>> result = pca.transform(df2, key='ID', n_components=4) >>> result.collect() ID COMPONENT_1 COMPONENT_2 COMPONENT_3 COMPONENT_4 0 1 -8.359662 -10.936083 3.037744 4.220525 1 2 -3.931082 3.221886 -1.168764 -2.629849 2 3 -6.584040 -10.391291 13.112075 -0.146681 3 4 -2.967768 -3.170720 6.198141 -1.213035
Attributes: - loadings_ : DataFrame
The weights by which each standardized original variable should be multiplied when computing component scores.
- loadings_stat_ : DataFrame
Loadings statistics on each component.
- scores_ : DataFrame
The transformed variable values corresponding to each data point. Set to None if scores is False.
- scaling_stat_ : DataFrame
Mean and scale values of each variable.
Methods
fit
(data, key[, features])Principal component analysis function. fit_transform
(data, key[, features])Fit with the dataset and return the scores. transform
(data, key[, features, n_components])Principal component analysis projection function using a trained model. -
fit
(data, key, features=None)¶ Principal component analysis function.
Parameters: - data : DataFrame
Data to be analyzed.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all non-ID columns.
-
fit_transform
(data, key, features=None)¶ Fit with the dataset and return the scores.
Parameters: - data : DataFrame
Data to be analyzed.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all non-ID columns.
Returns: - DataFrame
Transformed variable values corresponding to each data point, structured as follows:
- ID column, with same name and type as data’s ID column.
- Score columns, type DOUBLE, representing the component score values of each data point.
-
transform
(data, key, features=None, n_components=None)¶ Principal component analysis projection function using a trained model.
Parameters: - data : DataFrame
Data to be analyzed.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all non-ID columns.
- n_components : int, optional
Number of components to be retained. The value range is from 1 to number of features. Defaults to number of features.
Returns: - DataFrame
Transformed variable values corresponding to each data point, structured as follows:
- ID column, with same name and type as data’s ID column.
- Score columns, type DOUBLE, representing the component score values of each data point.
hana_ml.algorithms.pal.linear_model¶
This module contains PAL wrapper and helper functions for linear model algorithms. The following classes are available:
-
class
hana_ml.algorithms.pal.linear_model.
LinearRegression
(conn_context, solver=None, var_select=None, intercept=True, alpha_to_enter=None, alpha_to_remove=None, enet_lambda=None, enet_alpha=None, max_iter=None, tol=None, pho=None, stat_inf=False, adjusted_r2=False, dw_test=False, reset_test=None, bp_test=False, ks_test=False, thread_ratio=None, categorical_variable=None, pmml_export=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
A linear regression model, based on PAL_LINEAR_REGRESSION.
Parameters: - conn_context : ConnectionContext
Connection to the HANA system.
- solver : {‘QR’, ‘SVD’, ‘CD’, ‘Cholesky’, ‘ADMM’}, optional
Algorithms to use to solve the least square problem. Case-insensitive.
- ‘QR’: QR decomposition.
- ‘SVD’: singular value decomposition.
- ‘CD’: cyclical coordinate descent method.
- ‘Cholesky’: Cholesky decomposition.
- ‘ADMM’: alternating direction method of multipliers.
‘CD’ and ‘ADMM’ are supported only when var_select is ‘all’. Defaults to QR decomposition.
- var_select : {‘all’, ‘forward’, ‘backward’}
Method to perform variable selection.
- ‘all’: all variables are included.
- ‘forward’: forward selection.
- ‘backward’: backward selection.
‘forward’ and ‘backward’ selection are supported only when solver is ‘QR’, ‘SVD’ or ‘Cholesky’. Defaults to ‘all’.
- intercept : bool, optional
If true, include the intercept in the model. Defaults to True.
- alpha_to_enter : float, optional
P-value for forward selection. Valid only when var_select is ‘forward’. Defaults to 0.05.
- alpha_to_remove : float, optional
P-value for backward selection. Valid only when var_select is ‘backward’. Defaults to 0.1.
- enet_lambda : float, optional
Penalized weight. Value should be greater than or equal to 0. Valid only when solver is ‘CD’ or ‘ADMM’.
- enet_alpha : float, optional
Elastic net mixing parameter. Ranges from 0 (Ridge penalty) to 1 (LASSO penalty) inclusively. Valid only when solver is ‘CD’ or ‘ADMM’. Defaults to 1.0.
- max_iter : int, optional
Maximum number of passes over training data. If convergence is not reached after the specified number of iterations, an error will be generated. Valid only when solver is ‘CD’ or ‘ADMM’. Defaults to 1e5.
- tol : float, optional
Convergence threshold for coordinate descent. Valid only when solver is ‘CD’. Defaults to 1.0e-7.
- pho : float, optional
Step size for ADMM. Generally, it should be greater than 1. Valid only when solver is ‘ADMM’. Defaults to 1.8.
- stat_inf : bool, optional
If true, output t-value and Pr(>|t|) of coefficients. Defaults to False.
- adjusted_r2 : bool, optional
If true, include the adjusted R^2 value in statistics. Defaults to False.
- dw_test : bool, optional
If true, conduct Durbin-Watson test under null hypothesis that errors do not follow a first order autoregressive process. Not available if elastic net regularization is enabled or intercept is ignored. Defaults to False.
- reset_test : int, optional
Specifies the order of Ramsey RESET test. Ramsey RESET test with power of variables ranging from 2 to this value (greater than 1) will be conducted. Value 1 means RESET test will not be conducted. Not available if elastic net regularization is enabled or intercept is ignored. Defaults to 1.
- bp_test : bool, optional
If true, conduct Breusch-Pagan test under null hypothesis that homoscedasticity is satisfied. Not available if elastic net regularization is enabled or intercept is ignored. Defaults to False.
- ks_test : bool, optional
If true, conduct Kolmogorov-Smirnov normality test under null hypothesis that errors follow a normal distribution. Not available if elastic net regularization is enabled or intercept is ignored. Defaults to False.
- thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 means only using 1 thread, and 1 means using at most all the currently available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Valid only when solver is ‘QR’, ‘CD’, ‘Cholesky’ or ‘ADMM’. Defaults to 0.0.
- categorical_variable : list of str, optional
INTEGER columns specified in this list will be treated as categorical data. Other INTEGER columns will be treated as continuous.
- pmml_export : {‘no’, ‘multi-row’}, optional
Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.
- ‘no’ or not provided: No PMML model.
- ‘multi-row’: Exports a PMML model, splitting it across multiple rows if it doesn’t fit in one.
Prediction does not require a PMML model.
Examples
Training data:
>>> df.collect() ID Y X1 X2 X3 0 0 -6.879 0.00 A 1 1 1 -3.449 0.50 A 1 2 2 6.635 0.54 B 1 3 3 11.844 1.04 B 1 4 4 2.786 1.50 A 1 5 5 2.389 0.04 B 2 6 6 -0.011 2.00 A 2 7 7 8.839 2.04 B 2 8 8 4.689 1.54 B 1 9 9 -5.507 1.00 A 2
Training the model:
>>> lr = LinearRegression(cc, ... thread_ratio=0.5, ... categorical_variable=["X3"]) >>> lr.fit(df, key='ID', label='Y')
Prediction:
>>> df2.collect() ID X1 X2 X3 0 0 1.690 B 1 1 1 0.054 B 2 2 2 0.123 A 2 3 3 1.980 A 1 4 4 0.563 A 1 >>> lr.predict(df2, key='ID').collect() ID VALUE 0 0 10.314760 1 1 1.685926 2 2 -7.409561 3 3 2.021592 4 4 -3.122685
Attributes: - coefficients_ : DataFrame
Fitted regression coefficients.
- pmml_ : DataFrame
PMML model. Set to None if no PMML model was requested.
- fitted_ : DataFrame
Predicted dependent variable values for training data. Set to None if the training data has no row IDs.
- statistics_ : DataFrame
Regression-related statistics, such as mean squared error.
Methods
fit
(data[, key, features, label])Fit regression model based on training data. predict
(data, key[, features])Predict dependent variable values based on fitted model. score
(data, key[, features, label])Returns the coefficient of determination R^2 of the prediction. -
fit
(data, key=None, features=None, label=None)¶ Fit regression model based on training data.
Parameters: - data : DataFrame
Training data.
- key : str, optional
Name of the ID column. If key is not provided, it is assumed that the input has no ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.
- label : str, optional
Name of the dependent variable. If label is not provided, it defaults to the last column.
-
predict
(data, key, features=None)¶ Predict dependent variable values based on fitted model.
Parameters: - data : DataFrame
Independent variable values to predict for.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all non-ID columns.
Returns: - DataFrame
- Predicted values, structured as follows:
- ID column, with same name and type as data’s ID column.
- VALUE, type DOUBLE, representing predicted values.
Notes
predict() will pass the pmml_ table to PAL as the model representation if there is a pmml_ table, or the coefficients_ table otherwise.
-
score
(data, key, features=None, label=None)¶ Returns the coefficient of determination R^2 of the prediction.
Parameters: - data : DataFrame
Data on which to assess model performance.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults all non-ID, non-label columns.
- label : str, optional
Name of the dependent variable. If ‘label` is not provided, it defaults to the last column.
Returns: - accuracy : float
Returns the coefficient of determination R^2 of the prediction.
Notes
score() will pass the pmml_ table to PAL as the model representation if there is a pmml_ table, or the coefficients_ table otherwise.
-
class
hana_ml.algorithms.pal.linear_model.
LogisticRegression
(conn_context, multi_class=False, max_iter=None, pmml_export=None, categorical_variable=None, standardize=True, stat_inf=False, solver=None, alpha=None, lamb=None, tol=None, epsilon=None, thread_ratio=None, max_pass_number=None, sgd_batch_number=None, lbfgs_m=None, class_map0=None, class_map1=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
Logistic regression model that handles binary-class and multi-class classification problems.
Parameters: - conn_context : ConnectionContext
Connection to the HANA system.
- multi_class : bool, optional
If true, perform multi-class classification. Otherwise, there must be only two classes. Defaults to False.
- max_iter : int, optional
Maximum number of iterations taken for the solvers to converge. If convergence is not reached after this number, an error will be generated.
- multi-class: Defaults to 100.
- binary-class: Defaults to 100000 when solver is cyclical, 1000 when solver is proximal, otherwise 100.
- pmml_export : {‘no’, ‘single-row’, ‘multi-row’}, optional
Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.
- multi-class:
- ‘no’ or not provided: No PMML model.
- ‘multi-row’: Exports a PMML model, splitting it across multiple rows if it doesn’t fit in one.
- binary-class:
- ‘no’ or not provided: No PMML model.
- ‘single-row’: Exports a PMML model in a maximum of one row. Fails if the model doesn’t fit in one row.
- ‘multi-row’: Exports a PMML model, splitting it across multiple rows if it doesn’t fit in one.
Defaults to ‘no’.
- categorical_variable : list of str, optional
Column names in the data table used as category variable.
- standardize : bool, optional
If true, standardize the data to have zero mean and unit variance. Defaults to True.
- stat_inf : bool, optional
If true, proceed with statistical inference. Defaults to False.
- solver : {‘newton’, ‘cyclical’, ‘lbfgs’, ‘stochastic’, ‘proximal’}, optional
Optimization algorithm.
- ‘newton’: Newton iteration method.
- ‘cyclical’: Cyclical coordinate descent method to fit elastic net regularized logistic regression.
- ‘lbfgs’: LBFGS method (recommended when having many independent variables).
- ‘stochastic’: Stochastic gradient descent method (recommended when dealing with very large dataset).
- ‘proximal’: Proximal gradient descent method to fit elastic net regularized logistic regression.
Only valid when multi_class is False. Defaults to newton.
- alpha : float, optional
Elastic net mixing parameter. Only valid when multi_class is False and solver is newton, cyclical, lbfgs or proximal. Defaults to 1.0.
- lamb : float, optional
Penalized weight. Only valid when multi_class is False and solver is newton, cyclical, lbfgs or proximal. Defaults to 0.0.
- tol : float, optional
Convergence threshold for exiting iterations. Only valid when multi_class is False. Defaults to 1.0e-7 when solver is cyclical, 1.0e-6 otherwise.
- epsilon : float, optional
Determines the accuracy with which the solution is to be found. Only valid when multi_class is False and the solver is newton or lbfgs. Defaults to 1.0e-6 when solver is newton, 1.0e-5 when solver is lbfgs.
- thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. thread_ratio cannot be set separately for fit(), predict() and score(). Only valid when multi_class is False. Defaults to 1.0 for fit(), 0.0 for predict() and score().
- max_pass_number : int, optional
The maximum number of passes over the data. Only valid when multi_class is False and solver is stochastic. Defaults to 1.
- sgd_batch_number : int, optional
The batch number of Stochastic gradient descent. Only valid when multi_class is False and solver is stochastic. Defaults to 1.
- lbfgs_m : int, optional
Number of previous updates to keep. Only applicable when multi_class is False and solver is lbfgs. Defaults to 6.
- class_map0 : str, optional
Categorical label to map to 0. Only valid when multi_class is False. class_map0 is mandatory when label column type is VARCHAR or NVARCHAR during binary class fit and score.
- class_map1 : str, optional
Categorical label to map to 1. Only valid when multi_class is False. class_map1 is mandatory when label column type is VARCHAR or NVARCHAR during binary class fit and score.
Examples
Training data:
>>> df.collect() V1 V2 V3 CATEGORY 0 B 2.620 0 1 1 B 2.875 0 1 2 A 2.320 1 1 3 A 3.215 2 0 4 B 3.440 3 0 5 B 3.460 0 0 6 A 3.570 1 0 7 B 3.190 2 0 8 A 3.150 3 0 9 B 3.440 0 0 10 B 3.440 1 0 11 A 4.070 3 0 12 A 3.730 1 0 13 B 3.780 2 0 14 B 5.250 2 0 15 A 5.424 3 0 16 A 5.345 0 0 17 B 2.200 1 1 18 B 1.615 2 1 19 A 1.835 0 1 20 B 2.465 3 0 21 A 3.520 1 0 22 A 3.435 0 0 23 B 3.840 2 0 24 B 3.845 3 0 25 A 1.935 1 1 26 B 2.140 0 1 27 B 1.513 1 1 28 A 3.170 3 1 29 B 2.770 0 1 30 B 3.570 0 1 31 A 2.780 3 1
Create LogisticRegression instance and call fit:
>>> lr = linear_model.LogisticRegression(cc, solver='newton', ... thread_ratio=0.1, max_iter=1000, ... categorical_variable=['V3'], ... pmml_export='single-row', ... stat_inf=True, tol=0.000001) >>> lr.fit(df, features=['V1', 'V2', 'V3'], label='CATEGORY') >>> lr.coef_.collect() VARIABLE_NAME COEFFICIENT 0 __PAL_INTERCEPT__ 15.579882 1 V1__PAL_DELIMIT__B 0.000000 2 V1__PAL_DELIMIT__A 1.464903 3 V2 -4.819740 4 V3__PAL_DELIMIT__0 0.000000 5 V3__PAL_DELIMIT__1 -2.794139 6 V3__PAL_DELIMIT__2 -4.807858 7 V3__PAL_DELIMIT__3 -2.780918 >>> pred_df = cc.table('DATA_TBL_PREDICT') >>> pred_df.collect() ID V1 V2 V3 0 0 B 2.620 0 1 1 B 2.875 0 2 2 A 2.320 1 3 3 A 3.215 2 4 4 B 3.440 3 5 5 B 3.460 0 6 6 A 3.570 1 7 7 B 3.190 2 8 8 A 3.150 3 9 9 B 3.440 0 10 10 B 3.440 1 11 11 A 4.070 3 12 12 A 3.730 1 13 13 B 3.780 2 14 14 B 5.250 2 15 15 A 5.424 3 16 16 A 5.345 0 17 17 B 2.200 1
Call predict:
>>> result = lr.predict(pred_df, 'ID', ['V1', 'V2', 'V3']) >>> result.collect() ID CLASS PROBABILITY 0 0 1 9.503656e-01 1 1 1 8.485314e-01 2 2 1 9.555893e-01 3 3 0 3.702131e-02 4 4 0 2.229288e-02 5 5 0 2.504115e-01 6 6 0 4.946187e-02 7 7 0 9.922804e-03 8 8 0 2.853014e-01 9 9 0 2.689367e-01 10 10 0 2.200654e-02 11 11 0 4.714084e-03 12 12 0 2.349977e-02 13 13 0 5.830852e-04 14 14 0 4.886534e-07 15 15 0 6.938601e-06 16 16 0 1.637959e-04 17 17 1 8.986501e-01
Attributes: - coef_ : DataFrame
Values of the coefficients.
- result_ : DataFrame
Model content.
- pmml_ : DataFrame
PMML model. Set to None if no PMML model was requested.
Methods
fit
(data[, key, features, label])Fit the LR model when given training dataset. predict
(data, key[, features, verbose])Predict with the dataset using the trained model. score
(data, key[, features, label])Return the mean accuracy on the given test data and labels. -
fit
(data, key=None, features=None, label=None)¶ Fit the LR model when given training dataset.
Parameters: - data : DataFrame
DataFrame containing the data.
- key : str, optional
Name of the ID column. If key is not provided, it is assumed that the input has no ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.
- label : str, optional
Name of the label column. If label is not provided, it defaults to the last column.
-
predict
(data, key, features=None, verbose=False)¶ Predict with the dataset using the trained model.
Parameters: - data : DataFrame
DataFrame containing the data.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all non-ID columns.
- verbose : bool, optional
If true, output scoring probabilities for each class. It is only applicable for multi-class case. Defaults to False.
Returns: - DataFrame
- Predicted result, structured as follows:
- ID column, with the same name and type as data’s ID column.
- CLASS, type NVARCHAR, predicted class name.
- PROBABILITY, type DOUBLE
- multi-class: probability of being predicted as the predicted class.
- binary-class: probability of being predicted as the positive class.
Notes
predict() will pass the pmml_ table to PAL as the model representation if there is a pmml_ table, or the result_ table otherwise.
-
score
(data, key, features=None, label=None)¶ Return the mean accuracy on the given test data and labels.
Parameters: - data : DataFrame
DataFrame containing the data.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.
- label : str, optional
Name of the label column. If label is not provided, it defaults to the last column.
Returns: - accuracy : float
Scalar accuracy value after comparing the predicted label and original label.
hana_ml.algorithms.pal.metrics¶
This module contains PAL wrappers for metrics to assess the quality of model outputs.
The following functions are available:
-
hana_ml.algorithms.pal.metrics.
accuracy_score
(conn_context, data, label_true, label_pred)¶ Compute mean accuracy score for classification results. That is, the proportion of the correctly predicted results among the total number of cases examined.
Parameters: - conn_context : ConnectionContext
HANA connection.
- data : DataFrame
DataFrame of true and predicted labels.
- label_true : str
Name of the column containing ground truth labels.
- label_pred : str
Name of the column containing predicted labels, as returned by a classifier.
Returns: - accuracy : float
Accuracy classification score. A lower accuracy indicates that the classifier was able to predict less of the labels in the input correctly.
Examples
Actual and predicted labels for a hypothetical classification:
>>> df.collect() ACTUAL PREDICTED 0 1 0 1 0 0 2 0 0 3 1 1 4 1 1
Accuracy score for these predictions:
>>> accuracy_score(cc, df, label_true='ACTUAL', label_pred='PREDICTED') 0.8
Compare that to null accuracy (accuracy that could be achieved by always predicting the most frequent class):
>>> df_dummy.collect() ACTUAL PREDICTED 0 1 1 1 0 1 2 0 1 3 1 1 4 1 1 >>> accuracy_score(cc, df, label_true='ACTUAL', label_pred='PREDICTED') 0.6
A perfect predictor:
>>> df_perfect.collect() ACTUAL PREDICTED 0 1 1 1 0 0 2 0 0 3 1 1 4 1 1 >>> accuracy_score(cc, df, label_true='ACTUAL', label_pred='PREDICTED') 1.0
-
hana_ml.algorithms.pal.metrics.
auc
(conn_context, data, positive_label=None)¶ Compute area under curve (AUC) to evaluate the performance of binary-class classification algorithms.
Parameters: - conn_context : ConnectionContext
Connection to the HANA system.
- data : DataFrame
Input data, structured as follows: - ID column. - True class of the data point. - Classifier-computed probability that the data point belongs
to the positive class.
- positive_label : str, optional
If original label is not 0 or 1, specifies the label value which will be mapped to 1.
Returns: - auc : float
The area under the receiver operating characteristic curve.
- roc : DataFrame
False positive rate and true positive rate, structured as follows: - ID column, type INTEGER. - FPR, type DOUBLE, representing false positive rate. - TPR, type DOUBLE, representing true positive rate.
Examples
Input data:
>>> df.collect() ID ORIGINAL PREDICT 0 1 0 0.07 1 2 0 0.01 2 3 0 0.85 3 4 0 0.30 4 5 0 0.50 5 6 1 0.50 6 7 1 0.20 7 8 1 0.80 8 9 1 0.20 9 10 1 0.95
Compute Area Under Curve:
>>> auc, roc = auc(cc, df)
Ideal output:
>>> print(auc) 0.66
>>> roc.collect() ID FPR TPR 0 0 1.0 1.0 1 1 0.8 1.0 2 2 0.6 1.0 3 3 0.6 0.6 4 4 0.4 0.6 5 5 0.2 0.4 6 6 0.2 0.2 7 7 0.0 0.2 8 8 0.0 0.0
-
hana_ml.algorithms.pal.metrics.
confusion_matrix
(conn_context, data, key, label_true=None, label_pred=None, beta=None, native=True)¶ Compute confusion matrix to evaluate the accuracy of a classification.
Parameters: - conn_context : ConnectionContext
Database connection object.
- data : DataFrame
DataFrame containing the data.
- key : str
Name of the ID column.
- label_true : str, optional
Name of the original label column. If not given, defaults to the second columm.
- label_pred : str, optional
Name of the the predicted label column. If not given, defaults to the third columm.
- beta : float, optional
Parameter used to compute the F-Beta score. Default value: 1
- native : bool, optional
Indicates whether to use native sql statements for confusion matrix calculation. Default value: True
Returns: - confusion_matrix_df : DataFrame
- Confusion matrix, structured as follows:
- Original label, with same name and data type as it is in data.
- Predicted label, with same name and data type as it is in data.
- Count, type INTEGER, the number of data points with the corresponding combination of predicted and original label.
- The dataframe is sorted by (original label, predicted label) in descending
- order.
- classification_report_df : DataFrame
- Structured as follows:
- Class, type NVARCHAR(100), class name
- Recall, type DOUBLE, the recall of each class
- Precision, type DOUBLE, the precision of each class
- F_MEASURE, type DOUBLE, the F_measure of each class
- SUPPORT, type INTEGER, the support - sample number in each class
Examples
Data contains the original label and predict label:
>>> df.collect() ID ORIGINAL PREDICT 0 1 1 1 1 2 1 1 2 3 1 1 3 4 1 2 4 5 1 1 5 6 2 2 6 7 2 1 7 8 2 2 8 9 2 2 9 10 2 2
Calculate the confusion matrix
>>> cm, cr = confusion_matrix(connection_context, df, 'ID', 'ORIGINAL', ... 'PREDICT') >>> cm.collect() ORIGINAL PREDICT COUNT 0 1 1 4 1 1 2 1 2 2 1 1 3 2 2 4 >>> cr.collect() CLASS RECALL PRECISION F_MEASURE SUPPORT 0 1 0.8 0.8 0.8 5 1 2 0.8 0.8 0.8 5
-
hana_ml.algorithms.pal.metrics.
multiclass_auc
(conn_context, data_original, data_predict)¶ Compute area under curve (AUC) to evaluate the performance of multi-class classification algorithms.
Parameters: - conn_context : ConnectionContext
Connection to the HANA system.
- data_original : DataFrame
True class data, structured as follows: - Data point ID column. - True class of the data point.
- data_predict : DataFrame
Predicted class data, structured as follows: - Data point ID column. - Possible class. - Classifier-computed probability that the data point belongs
to that particular class.
For each data point ID, there should be one row for each possible class.
Returns: - auc : float
The area under the receiver operating characteristic curve.
- roc : DataFrame
False positive rate and true positive rate, structured as follows: - ID column, type INTEGER. - FPR, type DOUBLE, representing false positive rate. - TPR, type DOUBLE, representing true positive rate.
Examples
Input data:
>>> df_original.collect() ID ORIGINAL 0 1 1 1 2 1 2 3 1 3 4 2 4 5 2 5 6 2 6 7 3 7 8 3 8 9 3 9 10 3
>>> df_predict.collect() ID PREDICT PROB 0 1 1 0.90 1 1 2 0.05 2 1 3 0.05 3 2 1 0.80 4 2 2 0.05 5 2 3 0.15 6 3 1 0.80 7 3 2 0.10 8 3 3 0.10 9 4 1 0.10 10 4 2 0.80 11 4 3 0.10 12 5 1 0.20 13 5 2 0.70 14 5 3 0.10 15 6 1 0.05 16 6 2 0.90 17 6 3 0.05 18 7 1 0.10 19 7 2 0.10 20 7 3 0.80 21 8 1 0.00 22 8 2 0.00 23 8 3 1.00 24 9 1 0.20 25 9 2 0.10 26 9 3 0.70 27 10 1 0.20 28 10 2 0.20 29 10 3 0.60
Compute Area Under Curve:
>>> auc, roc = multiclass_auc(cc, df_original, df_predict)
Ideal output:
>>> print(auc) 1.0
>>> roc.collect() ID FPR TPR 0 0 1.00 1.0 1 1 0.90 1.0 2 2 0.65 1.0 3 3 0.25 1.0 4 4 0.20 1.0 5 5 0.00 1.0 6 6 0.00 0.9 7 7 0.00 0.7 8 8 0.00 0.3 9 9 0.00 0.1 10 10 0.00 0.0
-
hana_ml.algorithms.pal.metrics.
r2_score
(conn_context, data, label_true, label_pred)¶ Compute coefficient of determination for regression results.
Parameters: - conn_context : ConnectionContext
HANA connection.
- data : DataFrame
DataFrame of true and predicted values.
- label_true : str
Name of the column containing true values.
- label_pred : str
Name of the column containing values predicted by regression.
Returns: - r2 : float
Coefficient of determination. 1.0 indicates an exact match between true and predicted values. A lower coefficient of determination indicates that the regression was able to predict less of the variance in the input. A negative value indicates that the regression performed worse than just taking the mean of the true values and using that for every prediction.
Examples
Actual and predicted values for a hypothetical regression:
>>> df.collect() ACTUAL PREDICTED 0 0.10 0.2 1 0.90 1.0 2 2.10 1.9 3 3.05 3.0 4 4.00 3.5
R^2 score for these predictions:
>>> r2_score(cc, df, label_true='ACTUAL', label_pred='PREDICTED') 0.9685233682514102
Compare that to the score for a perfect predictor:
>>> df_perfect.collect() ACTUAL PREDICTED 0 0.10 0.10 1 0.90 0.90 2 2.10 2.10 3 3.05 3.05 4 4.00 4.00 >>> r2_score(cc, df_perfect, label_true='ACTUAL', label_pred='PREDICTED') 1.0
A naive mean predictor:
>>> df_mean.collect() ACTUAL PREDICTED 0 0.10 2.03 1 0.90 2.03 2 2.10 2.03 3 3.05 2.03 4 4.00 2.03 >>> r2_score(cc, df_mean, label_true='ACTUAL', label_pred='PREDICTED') 0.0
And a really awful predictor:
>>> df_awful.collect() ACTUAL PREDICTED 0 0.10 12345.0 1 0.90 91923.0 2 2.10 -4444.0 3 3.05 -8888.0 4 4.00 -9999.0 >>> r2_score(cc, df_awful, label_true='ACTUAL', label_pred='PREDICTED') -886477397.139857
hana_ml.algorithms.pal.mixture¶
This module includes mixture modeling algorithms.
The following classes are available:
-
class
hana_ml.algorithms.pal.mixture.
GaussianMixture
(conn_context, n_components=None, seeds=None, thread_ratio=None, max_iter=None, categorical_variable=None, category_weight=None, error_tol=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
Representation of a Gaussian mixture model probability distribution.
Parameters: - conn_context : ConnectionContext
Connection to the HANA system.
- n_components : int
Specifies the number of Gaussian distributions. Either n_components or seeds needs to be provided.
- seeds : list of int
Specifies the data (by using sequence number of the data in the data table (starting from 0)) to be used as seeds. Either n_components or seeds needs to be provided.
- thread_ratio : float, optional
Controls the proportion of available threads that can be used. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.
- max_iter : int, optional
Specifies the maximum number of iterations for the EM algorithm. Default value: 100.
- categorical_variable : list of str, optional
Indicates features should be treated as categorical. The default behavior is: - string: categorical - integer and float: continuous VALID only for integer variables; omitted otherwise. Default value detected from input data.
- category_weight : float, optional
Represents the weight of category attributes. Defaults to 0.707.
- error_tol : float, optional
Specifies the error tolerance, which is the stop condition. Defaults to 1e-5.
Examples
Input dataframe for training:
>>> df1.collect() ID X1 X2 X3 0 0 0.10 0.10 1 1 1 0.11 0.10 1 2 2 0.10 0.11 1 3 3 0.11 0.11 1 4 4 0.12 0.11 1 5 5 0.11 0.12 1 6 6 0.12 0.12 1 7 7 0.12 0.13 1 8 8 0.13 0.12 2 9 9 0.13 0.13 2 10 10 0.13 0.14 2 11 11 0.14 0.13 2 12 12 10.10 10.10 1 13 13 10.11 10.10 1 14 14 10.10 10.11 1 15 15 10.11 10.11 1 16 16 10.11 10.12 2 17 17 10.12 10.11 2 18 18 10.12 10.12 2 19 19 10.12 10.13 2 20 20 10.13 10.12 2 21 21 10.13 10.13 2 22 22 10.13 10.14 2 23 23 10.14 10.13 2
Creating GMM instance:
>>> gmm = GaussianMixture(conn_context=cc, init='n_components', n_components=2, ... max_iter=500, error_tol=0.001, thread_ratio=0.5, ... categorical_variable=['X3'])
Performing fit() on given dataframe:
>>> gmm.fit(df1, key='ID') >>> gmm.labels_.head(14).collect() ID CLUSTER_ID PROBABILITY 0 0 0 1.0 1 0 1 0.0 2 1 0 1.0 3 1 1 0.0 4 2 0 1.0 5 2 1 0.0 6 3 0 1.0 7 3 1 0.0 8 4 0 1.0 9 4 1 0.0 10 5 0 1.0 11 5 1 0.0 12 6 0 1.0 13 6 1 0.0
Attributes: - model_ : DataFrame
Trained model content.
- labels_ : DataFrame
Cluster membership probabilties for each data point.
Methods
fit
(data, key[, features])Perform GMM clustering on input dataset. fit_predict
(data, key[, features])Perform GMM clustering on input dataset and return cluster membership probabilties for each data point. -
fit
(data, key, features=None)¶ Perform GMM clustering on input dataset.
Parameters: - data : DataFrame
Data to be clustered.
- key : str
Name of the ID column.
- features : list of str, optional
List of strings specifying feature columns. If a list of features is not given, all the columns except the ID column are taken as features.
-
fit_predict
(data, key, features=None)¶ Perform GMM clustering on input dataset and return cluster membership probabilties for each data point.
Parameters: - data : DataFrame
Data to be clustered.
- key : str
Name of the ID column.
- features : list of str, optional
List of strings specifying feature columns. If a list of features is not given, all the columns except the ID column are taken as features.
Returns: - DataFrame
Cluster membership probabilities.
hana_ml.algorithms.pal.naive_bayes¶
This module contains wrappers for PAL naive bayes classification.
The following classes are available:
-
class
hana_ml.algorithms.pal.naive_bayes.
NaiveBayes
(conn_context, alpha=None, discretization=None, model_format=None, categorical_variable=None, thread_ratio=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
A classification model based on Bayes’ theorem.
Parameters: - conn_context : ConnectionContext
Connection to the HANA system.
- alpha : float, optional
Laplace smoothing value. Set a positive value to enable Laplace smoothing for categorical variables and use that value as the smoothing parameter. Set value 0 to disable Laplace smoothing.
Defaults to 0.
- discretization : {‘no’, ‘supervised’}, optional
- Discretize continuous attributes. Case-insensitive.
- ‘no’ or not provided: disable discretization.
- ‘supervised’: use supervised discretization on all the continuous attributes.
Defaults to no.
- model_format : {‘json’, ‘pmml’}, optional
Controls whether to output the model in JSON format or PMML format. Case-insensitive.
- ‘json’ or not provided: JSON format.
- ‘pmml’: PMML format.
Defaults to json.
- categorical_variable : list of str, optional
INTEGER columns specified in this list will be treated as categorical data. Other INTEGER columns will be treated as continuous.
- thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads.Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.
Notes
The Laplace value (alpha) is only stored by JSON format models. If the PMML format is chosen, you may need to set the Laplace value (alpha) again in predict() and score().
Examples
Training data:
>>> df1.collect() HomeOwner MaritalStatus AnnualIncome DefaultedBorrower 0 YES Single 125.0 NO 1 NO Married 100.0 NO 2 NO Single 70.0 NO 3 YES Married 120.0 NO 4 NO Divorced 95.0 YES 5 NO Married 60.0 NO 6 YES Divorced 220.0 NO 7 NO Single 85.0 YES 8 NO Married 75.0 NO 9 NO Single 90.0 YES
Training the model:
>>> nb = NaiveBayes(cc, alpha=1.0, model_format='pmml') >>> nb.fit(df1)
Prediction:
>>> df2.collect() ID HomeOwner MaritalStatus AnnualIncome 0 0 NO Married 120.0 1 1 YES Married 180.0 2 2 NO Single 90.0
>>> nb.predict(df2, 'ID', alpha=1.0, verbose=True) ID CLASS CONFIDENCE 0 0 NO -6.572353 1 0 YES -23.747252 2 1 NO -7.602221 3 1 YES -169.133547 4 2 NO -7.133599 5 2 YES -4.648640
Attributes: - model_ : DataFrame
Trained model content.
Methods
fit
(data[, key, features, label])Fit classification model based on training data. predict
(data, key[, features, alpha, verbose])Predict based on fitted model. score
(data, key[, features, label, alpha])Returns the mean accuracy on the given test data and labels. -
fit
(data, key=None, features=None, label=None)¶ Fit classification model based on training data.
Parameters: - data : DataFrame
Training data.
- key : str, optional
Name of the ID column. If key is not provided, it is assumed that the input has no ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.
- label : str, optional
Name of the dependent variable. Defaults to the last column.
-
predict
(data, key, features=None, alpha=None, verbose=None)¶ Predict based on fitted model.
Parameters: - data : DataFrame
Independent variable values to predict for.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all non-ID columns.
- alpha : float, optional
Laplace smoothing value. Set a positive value to enable Laplace smoothing for categorical variables and use that value as the smoothing parameter. Set value 0 to disable Laplace smoothing.
Defaults to the alpha value in the JSON model, if there is one, or 0 otherwise.
- verbose : bool, optional
If true, output all classes and the corresponding confidences for each data point.
Defaults to False.
Returns: - DataFrame
- Predicted result, structured as follows:
- ID column, with the same name and type as data’s ID column.
- CLASS, type NVARCHAR, predicted class name.
- CONFIDENCE, type DOUBLE, confidence for the prediction of the sample, which is a logarithmic value of the posterior probabilities.
Notes
A non-zero Laplace value (alpha) is required if there exist discrete category values that only occur in the test set. It can be read from JSON models or from the parameter alpha in predict(). The Laplace value you set here takes precedence over the values read from JSON models.
-
score
(data, key, features=None, label=None, alpha=None)¶ Returns the mean accuracy on the given test data and labels.
Parameters: - data : DataFrame
Data on which to assess model performance.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.
- label : str, optional
Name of the dependent variable. Defaults to the last column.
- alpha : float, optional
Laplace smoothing value. Set a positive value to enable Laplace smoothing for categorical variables and use that value as the smoothing parameter. Set value 0 to disable Laplace smoothing.
Defaults to the alpha value in the JSON model, if there is one, or 0 otherwise.
Returns: - float
Mean accuracy on the given test data and labels.
hana_ml.algorithms.pal.neighbors¶
This module contains PAL wrappers for the k-nearest neighbors algorithms.
The following classes are available:
-
class
hana_ml.algorithms.pal.neighbors.
KNN
(conn_context, n_neighbors=None, thread_ratio=None, voting_type=None, stat_info=True, metric=None, minkowski_power=None, algorithm=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
K-Nearest Neighbor(KNN) model that handles classification problems.
Parameters: - conn_context : ConnectionContext
Connection to the HANA sytem.
- n_neighbors : int, optional
Number of nearest neighbors. Defaults to 1.
- thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.
- voting_type : {‘majority’, ‘distance-weighted’}, optional
Method used to vote for the most frequent label of the K nearest neighbors. Defaults to distance-weighted.
- stat_info : bool, optional
Controls whether to return a statistic information table containing the distance between each point in the prediction set and its k nearest neighbors in the training set. If true, the table will be returned. Defaults to True.
- metric : {‘manhattan’, ‘euclidean’, ‘minkowski’, ‘chebyshev’}, optional
Ways to compute the distance between data points. Defaults to euclidean.
- minkowski_power : float, optional
When Minkowski is used for metric, this parameter controls the value of power. Only valid when metric is Minkowski. Defaults to 3.0.
- algorithm : {‘brute-force’, ‘kd-tree’}, optional
Algorithm used to compute the nearest neighbors. Defaults to brute-force.
Examples
Training data:
>>> df.collect() ID X1 X2 TYPE 0 0 1.0 1.0 2 1 1 10.0 10.0 3 2 2 10.0 11.0 3 3 3 10.0 10.0 3 4 4 1000.0 1000.0 1 5 5 1000.0 1001.0 1 6 6 1000.0 999.0 1 7 7 999.0 999.0 1 8 8 999.0 1000.0 1 9 9 1000.0 1000.0 1
Create KNN instance and call fit:
>>> knn = KNN(connection_context, n_neighbors=3, voting_type='majority', ... thread_ratio=0.1, stat_info=False) >>> knn.fit(df, 'ID', features=['X1', 'X2'], label='TYPE') >>> pred_df = connection_context.table("PAL_KNN_CLASSDATA_TBL")
Call predict:
>>> res, stat = knn.predict(pred_df, "ID") >>> res.collect() ID TYPE 0 0 3 1 1 3 2 2 3 3 3 1 4 4 1 5 5 1 6 6 1 7 7 1
Methods
fit
(data, key[, features, label])Fit the model when given training set. predict
(data, key[, features])Predict the class labels for the provided data score
(data, key[, features, label])Return a scalar accuracy value after comparing the predicted and original label. -
fit
(data, key, features=None, label=None)¶ Fit the model when given training set.
Parameters: - data : DataFrame
DataFrame containing the data.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all the non-ID and non-label columns.
- label : str, optional
Name of the label column. If label is not provided, it defaults to the last column.
-
predict
(data, key, features=None)¶ Predict the class labels for the provided data
Parameters: - data : DataFrame
DataFrame containing the data.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.
Returns: - result_df : DataFrame
- Predicted result, structured as follows:
- ID column, with same name and type as data’s ID column.
- Label column, with same name and type as training data’s label column.
- nearest_neighbors_df : DataFrame
The distance between each point in data and its k nearest neighbors in the training set. Only returned if stat_info is True. Structured as follows:
-
score
(data, key, features=None, label=None)¶ Return a scalar accuracy value after comparing the predicted and original label.
Parameters: - data : DataFrame
DataFrame containing the data.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all the non-ID and non-label columns.
- label : str, optional
Name of the label column. If label is not provided, it defaults to the last column.
Returns: - accuracy : float
Scalar accuracy value after comparing the predicted label and original label.
hana_ml.algorithms.pal.neural_network¶
This module contains PAL wrappers for Multi-layer Perceptron algorithms.
The following classes are available:
-
class
hana_ml.algorithms.pal.neural_network.
MLPClassifier
(conn_context, activation, output_activation, hidden_layer_size, max_iter=None, training_style='stochastic', learning_rate=None, momentum=None, batch_size=None, normalization=None, weight_init=None, categorical_variable=None, thread_ratio=None)¶ Bases:
hana_ml.algorithms.pal.neural_network._MLPBase
Multi-layer perceptron (MLP) Classifier.
Parameters: - conn_context : ConnectionContext
Connection to the HANA system.
- activation : str
Activation function for the hidden layer:
- ‘tanh’
- ‘linear’
- ‘sigmoid_asymmetric’
- ‘sigmoid_symmetric’
- ‘gaussian_asymmetric’
- ‘gaussian_symmetric’
- ‘elliot_asymmetric’
- ‘elliot_symmetric’
- ‘sin_asymmetric’
- ‘sin_symmetric’
- ‘cos_asymmetric’
- ‘cos_symmetric’
- ‘relu’
- output_activation : str
Activation function for the output layer:
- ‘tanh’
- ‘linear’
- ‘sigmoid_asymmetric’
- ‘sigmoid_symmetric’
- ‘gaussian_asymmetric’
- ‘gaussian_symmetric’
- ‘elliot_asymmetric’
- ‘elliot_symmetric’
- ‘sin_asymmetric’
- ‘sin_symmetric’
- ‘cos_asymmetric’
- ‘cos_symmetric’
- ‘relu’
- hidden_layer_size : tuple of int
Size of each hidden layer.
- max_iter : int, optional
Maximum number of iterations. Defaults to 100.
- training_style : {‘batch’, ‘stochastic’}, optional
Specifies the training style. Defaults to stochastic.
- learning_rate : float, optional
Specifies the learning rate. Only valid when training_style is stochastic.
- momentum : float, optional
Specifies the momentum for gradient descent update. Only valid when training_style is stochastic.
- batch_size : int, optional
Specifies the size of mini batch. Only valid when training_style is stochastic. Defaults to 1.
- normalization : {‘no’, ‘z-transform’, ‘scalar’}, optional
Defaults to no (no normalization).
- weight_init :str, optional
Specifies the weight initial value.
- ‘all-zeros’
- ‘normal’
- ‘uniform’
- ‘variance-scale-normal’
- ‘variance-scale-uniform’
Defaults to all-zeros.
- categorical_variable : list of str, optional
Column names in the data table used as category variable.
- thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.
Examples
Training data:
>>> df = connection_context.table("PAL_TRAIN_MLP_REG_DATA_TBL") >>> df.collect() V000 V001 V002 V003 LABEL 0 1 1.71 AC 0 AA 1 10 1.78 CA 5 AB 2 17 2.36 AA 6 AA 3 12 3.15 AA 2 C 4 7 1.05 CA 3 AB 5 6 1.50 CA 2 AB 6 9 1.97 CA 6 C 7 5 1.26 AA 1 AA 8 12 2.13 AC 4 C 9 18 1.87 AC 6 AA
Training the model:
>>> mlpc = MLPClassifier(connection_context, hidden_layer_size=(10,10), ... activation='TANH', output_activation='TANH', ... learning_rate=0.001, momentum=0.0001, ... training_style='stochastic',max_iter=100, ... normalization='z-transform', weight_init='normal', ... thread_ratio=0.3, categorical_variable='V003') >>> mlpc.fit(df)
Training result may look different from the following results due to model randomness.
>>> mlpc.model_.collect() ROW_INDEX MODEL_CONTENT 0 1 {"CurrentVersion":"1.0","DataDictionary":[{"da... 1 2 t":0.2700182926188939},{"from":13,"weight":0.0... 2 3 ht":0.2414416413305134},{"from":21,"weight":0.... >>> mlpc.train_log_.collect() ITERATION ERROR 0 1 1.080261 1 2 1.008358 2 3 0.947069 3 4 0.894585 4 5 0.849411 5 6 0.810309 6 7 0.776256 7 8 0.746413 8 9 0.720093 9 10 0.696737 10 11 0.675886 11 12 0.657166 12 13 0.640270 13 14 0.624943 14 15 0.609432 15 16 0.595204 16 17 0.582101 17 18 0.569990 18 19 0.558757 19 20 0.548305 20 21 0.538553 21 22 0.529429 22 23 0.521457 23 24 0.513893 24 25 0.506704 25 26 0.499861 26 27 0.493338 27 28 0.487111 28 29 0.481159 29 30 0.475462 .. ... ... 70 71 0.349684 71 72 0.347798 72 73 0.345954 73 74 0.344071 74 75 0.342232 75 76 0.340597 76 77 0.338837 77 78 0.337236 78 79 0.335749 79 80 0.334296 80 81 0.332759 81 82 0.331255 82 83 0.329810 83 84 0.328367 84 85 0.326952 85 86 0.325566 86 87 0.324232 87 88 0.322899 88 89 0.321593 89 90 0.320242 90 91 0.318985 91 92 0.317840 92 93 0.316630 93 94 0.315376 94 95 0.314210 95 96 0.313066 96 97 0.312021 97 98 0.310916 98 99 0.309770 99 100 0.308704
[100 rows x 2 columns]
Prediction:
>>> pred_df = connection_context.table("PAL_PREDICT_MLP_CLS_DATA_TBL") >>> res, stat = mlpc.predict(pred_df, 'ID')
Prediction result may look different from the following results due to model randomness.
>>> res.collect() ID TARGET VALUE 0 1 C 0.472751 1 2 C 0.417681 2 3 C 0.543967 >>> stat.collect() ID CLASS SOFT_MAX 0 1 AA 0.371996 1 1 AB 0.155253 2 1 C 0.472751 3 2 AA 0.357822 4 2 AB 0.224496 5 2 C 0.417681 6 3 AA 0.349813 7 3 AB 0.106220 8 3 C 0.543967
Attributes: - model_ : DataFrame
Model content.
- train_log_ : DataFrame
Provides mean squared error between predicted values and target values for each iteration.
Methods
fit
(data[, key, features, label])Fit the model when given training dataset. predict
(data, key[, features])Predict using the multi-layer perceptron model. score
(data, key[, features, label])Returns the accuracy on the given test data and labels. -
fit
(data, key=None, features=None, label=None)¶ Fit the model when given training dataset.
Parameters: - data : DataFrame
DataFrame containing the data.
- key : str, optional
Name of the ID column. If key is not provided, it is assumed that the input has no ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all the non-ID and non-label columns.
- label : str, optional
Name of the label column. If label is not provided, it defaults to the last column.
-
predict
(data, key, features=None)¶ Predict using the multi-layer perceptron model.
Parameters: - data : DataFrame
DataFrame containing the data.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.
Returns: - result_df : DataFrame
- Predicted classes, structured as follows:
- ID column, with the same name and type as data’s ID column.
- TARGET, type NVARCHAR, predicted class name.
- VALUE, type DOUBLE, softmax value for the predicted class.
- softmax_df : DataFrame
- Softmax values for all classes, structured as follows:
- ID column, with the same name and type as data’s ID column.
- CLASS, type NVARCHAR, class name.
- VALUE, type DOUBLE, softmax value for that class.
-
score
(data, key, features=None, label=None)¶ Returns the accuracy on the given test data and labels.
Parameters: - data : DataFrame
DataFrame containing the data.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all the non-ID and non-label columns.
- label : str, optional
Name of the label column. If label is not provided, it defaults to the last column.
Returns: - accuracy : float
Scalar value of accuracy after comparing the predicted result and original label.
-
class
hana_ml.algorithms.pal.neural_network.
MLPRegressor
(conn_context, activation, output_activation, hidden_layer_size, max_iter=None, training_style='stochastic', learning_rate=None, momentum=None, batch_size=None, normalization=None, weight_init=None, categorical_variable=None, thread_ratio=None)¶ Bases:
hana_ml.algorithms.pal.neural_network._MLPBase
Multi-layer perceptron (MLP) Regressor.
Parameters: - conn_context : ConnectionContext
Connection to the HANA system.
- activation : str
Activation function for the hidden layer:
- ‘tanh’
- ‘linear’
- ‘sigmoid_asymmetric’
- ‘sigmoid_symmetric’
- ‘gaussian_asymmetric’
- ‘gaussian_symmetric’
- ‘elliot_asymmetric’
- ‘elliot_symmetric’
- ‘sin_asymmetric’
- ‘sin_symmetric’
- ‘cos_asymmetric’
- ‘cos_symmetric’
- ‘relu’
- output_activation : str
Activation function for the output layer:
- ‘tanh’
- ‘linear’
- ‘sigmoid_asymmetric’
- ‘sigmoid_symmetric’
- ‘gaussian_asymmetric’
- ‘gaussian_symmetric’
- ‘elliot_asymmetric’
- ‘elliot_symmetric’
- ‘sin_asymmetric’
- ‘sin_symmetric’
- ‘cos_asymmetric’
- ‘cos_symmetric’
- ‘relu’
- hidden_layer_size : tuple of int
Size of each hidden layer.
- max_iter : int, optional
Maximum number of iterations. Defaults to 100.
- training_style : {‘batch’, ‘stochastic’}, optional
Specifies the training style. Defaults to stochastic.
- learning_rate : float, optional
Specifies the learning rate. Only valid when training_style is stochastic.
- momentum : float, optional
Specifies the momentum for gradient descent update. Only valid when training_style is stochastic.
- batch_size : int, optional
Specifies the size of mini batch. Only valid when training_style is stochastic. Defaults to 1.
- normalization : {‘no’, ‘z-transform’, ‘scalar’}, optional
Defaults to no (no normalization).
- weight_init : str, optional
Specifies the weight initial value.
- ‘all-zeros’
- ‘normal’
- ‘uniform’
- ‘variance-scale-normal’
- ‘variance-scale-uniform’
Defaults to all-zeros.
- categorical_variable : list of str, optional
Column names in the data table used as category variable.
- thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.
Examples
Training data:
>>> df = connection_context.table("PAL_TRAIN_MLP_REG_DATA_TBL") >>> df.collect() V000 V001 V002 V003 T001 T002 T003 0 1 1.71 AC 0 12.7 2.8 3.06 1 10 1.78 CA 5 12.1 8.0 2.65 2 17 2.36 AA 6 10.1 2.8 3.24 3 12 3.15 AA 2 28.1 5.6 2.24 4 7 1.05 CA 3 19.8 7.1 1.98 5 6 1.50 CA 2 23.2 4.9 2.12 6 9 1.97 CA 6 24.5 4.2 1.05 7 5 1.26 AA 1 13.6 5.1 2.78 8 12 2.13 AC 4 13.2 1.9 1.34 9 18 1.87 AC 6 25.5 3.6 2.14
Training the model:
>>> mlpr = MLPRegressor(connection_context, hidden_layer_size=(10,5), ... activation='SIN_ASYMMETRIC', ... output_activation='SIN_ASYMMETRIC', ... learning_rate=0.001, momentum=0.00001, ... training_style='batch', ... max_iter=10000, normalization='z-transform', ... weight_init='normal', thread_ratio=0.3) >>> mlpr.fit(df, label=['T001', 'T002', 'T003'])
Training result may look different from the following results due to model randomness.
>>> mlpr.model_.collect() ROW_INDEX MODEL_CONTENT 0 1 {"CurrentVersion":"1.0","DataDictionary":[{"da... 1 2 3782583596893},{"from":10,"weight":-0.16532599... >>> mlpr.train_log_.collect() ITERATION ERROR 0 1 34.525655 1 2 82.656301 2 3 67.289241 3 4 162.768062 4 5 38.988242 5 6 142.239468 6 7 34.467742 7 8 31.050946 8 9 30.863581 9 10 30.078204 10 11 26.671436 11 12 28.078312 12 13 27.243226 13 14 26.916686 14 15 26.782915 15 16 26.724266 16 17 26.697108 17 18 26.684084 18 19 26.677713 19 20 26.674563 20 21 26.672997 21 22 26.672216 22 23 26.671826 23 24 26.671631 24 25 26.671533 25 26 26.671485 26 27 26.671460 27 28 26.671448 28 29 26.671442 29 30 26.671439 .. ... ... 705 706 11.891081 706 707 11.891081 707 708 11.891081 708 709 11.891081 709 710 11.891081 710 711 11.891081 711 712 11.891081 712 713 11.891081 713 714 11.891081 714 715 11.891081 715 716 11.891081 716 717 11.891081 717 718 11.891081 718 719 11.891081 719 720 11.891081 720 721 11.891081 721 722 11.891081 722 723 11.891081 723 724 11.891081 724 725 11.891081 725 726 11.891081 726 727 11.891081 727 728 11.891081 728 729 11.891081 729 730 11.891081 730 731 11.891081 731 732 11.891081 732 733 11.891081 733 734 11.891081 734 735 11.891081
[735 rows x 2 columns]
>>> pred_df = connection_context.table("PAL_PREDICT_MLP_REG_DATA_TBL") >>> pred_df.collect() ID V000 V001 V002 V003 0 1 1 1.71 AC 0 1 2 10 1.78 CA 5 2 3 17 2.36 AA 6
Prediction:
>>> res = mlpr.predict(pred_df, 'ID')
Result may look different from the following results due to model randomness.
>>> res.collect() ID TARGET VALUE 0 1 T001 12.700012 1 1 T002 2.799133 2 1 T003 2.190000 3 2 T001 12.099740 4 2 T002 6.100000 5 2 T003 2.190000 6 3 T001 10.099961 7 3 T002 2.799659 8 3 T003 2.190000
Attributes: - model_ : DataFrame
Model content.
- train_log_ : DataFrame
Provides mean squared error between predicted values and target values for each iteration.
Methods
fit
(data[, key, features, label])Fit the model when given training dataset. predict
(data, key[, features])Predict using the multi-layer perceptron model. score
(data, key[, features, label])Returns the coefficient of determination R^2 of the prediction. -
fit
(data, key=None, features=None, label=None)¶ Fit the model when given training dataset.
Parameters: - data : DataFrame
DataFrame containing the data.
- key : str, optional
Name of the ID column. If key is not provided, it is assumed that the input has no ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all the non-ID and non-label columns.
- label : str or list of str, optional
Name of the label column, or list of names of multiple label columns. If label is not provided, it defaults to the last column.
-
predict
(data, key, features=None)¶ Predict using the multi-layer perceptron model.
Parameters: - data : DataFrame
DataFrame containing the data.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.
Returns: - DataFrame
- Predicted results, structured as follows:
- ID column, with the same name and type as data’s ID column.
- TARGET, type NVARCHAR, target name.
- VALUE, type DOUBLE, regression value.
-
score
(data, key, features=None, label=None)¶ Returns the coefficient of determination R^2 of the prediction.
Parameters: - data : DataFrame
DataFrame containing the data.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all the non-ID and non-label columns.
- label : str or list of str, optional
Name of the label column, or list of names of multiple label columns. If label is not provided, it defaults to the last column.
Returns: - accuracy : float
Returns the coefficient of determination R^2 of the prediction.
hana_ml.algorithms.pal.preprocessing¶
This module contains PAL wrappers for preprocessing algorithms.
The following classes are available:
-
class
hana_ml.algorithms.pal.preprocessing.
FeatureNormalizer
(conn_context, method, z_score_method=None, new_max=None, new_min=None, thread_ratio=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
Normalize a dataframe.
Parameters: - conn_context : ConnectionContext
Connection to the HANA system.
- method : {‘min-max’, ‘z-score’, ‘decimal’}
- Scaling methods:
- ‘min-max’: Min-max normalization
- ‘z-score’: Z-Score normalization
- ‘decimal’: Decimal scaling normalization
- z_score_method : {‘mean-standard’, ‘mean-mean’, ‘median-median’}, optional
- Only valid when method is ‘z-score’.
- ‘mean-standard’: Mean-Standard deviation
- ‘mean-mean’: Mean-Mean deviation
- ‘median-median’: Median-Median absolute deviation
- new_max : float, optional
The new maximum value for min-max normalization. Only valid when method is ‘min-max’.
- new_min : float, optional
The new minimum value for min-max normalization. Only valid when method is ‘min-max’.
- thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Does not affect transform(). Defaults to 0.
Examples
Input dataframe for training:
>>> df1.head(4).collect() ID X1 X2 0 0 6.0 9.0 1 1 12.1 8.3 2 2 13.5 15.3 3 3 15.4 18.7
Creating FeatureNormalizer instance:
>>> fn = FeatureNormalizer(cc, method="min-max", new_max=1.0, new_min=0.0)
Performing fit() on given dataframe:
>>> fn.fit(df1, key='ID') >>> fn.result_.head(4).collect() ID X1 X2 0 0 0.000000 0.033175 1 1 0.186544 0.000000 2 2 0.229358 0.331754 3 3 0.287462 0.492891
Input dataframe for transforming:
>>> df2.collect() ID S_X1 S_X2 0 0 6.0 9.0 1 1 6.0 7.0 2 2 4.0 4.0 3 3 1.0 2.0 4 4 9.0 -2.0 5 5 4.0 5.0
Performing transform() on given dataframe:
>>> result = fn.transform(df2, key='ID') >>> result.collect() ID S_X1 S_X2 0 0 0.000000 0.033175 1 1 0.000000 -0.061611 2 2 -0.061162 -0.203791 3 3 -0.152905 -0.298578 4 4 0.091743 -0.488152 5 5 -0.061162 -0.156398
Attributes: - result_ : DataFrame
Scaled dataset from fit() and fit_transform().
- model_ :
Trained model content.
Methods
fit
(data, key[, features])Normalize input data and generate a scaling model using one of the three scaling methods: min-max normalization, z-score normalization and normalization by decimal scaling. fit_transform
(data, key[, features])Fit with the dataset and return the results. transform
(data, key[, features])Scales data based on the previous scaling model. -
fit
(data, key, features=None)¶ Normalize input data and generate a scaling model using one of the three scaling methods: min-max normalization, z-score normalization and normalization by decimal scaling.
Parameters: - data : DataFrame
DataFrame to be normalized.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all non-ID columns.
-
fit_transform
(data, key, features=None)¶ Fit with the dataset and return the results.
Parameters: - data : DataFrame
DataFrame to be normalized.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all non-ID columns.
Returns: - DataFrame
Normalized result, with the same structure as data.
-
transform
(data, key, features=None)¶ Scales data based on the previous scaling model.
Parameters: - data : DataFrame
DataFrame to be normalized.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all non-ID columns.
Returns: - DataFrame
Normalized result, with the same structure as data.
-
class
hana_ml.algorithms.pal.preprocessing.
KBinsDiscretizer
(conn_context, strategy, smoothing, n_bins=None, bin_size=None, n_sd=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
Bin continuous data into number of intervals and perform local smoothing.
Parameters: - conn_context : ConnectionContext
Connection to the HANA system.
- strategy : {‘uniform_number’, ‘uniform_size’, ‘quantile’, ‘sd’}
- Binning methods:
- ‘uniform_number’: Equal widths based on the number of bins.
- ‘uniform_size’: Equal widths based on the bin size.
- ‘quantile’: Equal number of records per bin.
- ‘sd’: Bins are divided based on the distance from the mean. Most bins are one standard deviation wide, except that the center bin contains all values within one standard deviation from the mean, and the leftmost and rightmost bins contain all values more than n_sd standard deviations from the mean in the corresponding directions.
- smoothing : {‘means’, ‘medians’, ‘boundaries’}
- Smoothing methods:
- ‘means’: Each value within a bin is replaced by the average of all the values belonging to the same bin.
- ‘medians’: Each value in a bin is replaced by the median of all the values belonging to the same bin.
- ‘boundaries’: The minimum and maximum values in a given bin are identified as the bin boundaries. Each value in the bin is then replaced by its closest boundary value. When the distance is equal to both sides, it will be replaced by the front boundary value.
Values used for smoothing are not re-calculated during transform().
- n_bins : int, optional
The number of bins. Only valid when strategy is ‘uniform_number’ or ‘quantile’. Defaults to 2.
- bin_size : int, optional
The interval width of each bin. Only valid when strategy is ‘uniform_size’. Defaults to 10.
- n_sd : int, optional
The leftmost bin contains all values located further than n_sd standard deviations lower than the mean, and the rightmost bin contains all values located further than n_sd standard deviations above the mean. Only valid when strategy is ‘sd’. Defaults to 1.
Examples
Input dataframe for fitting:
>>> df1.collect() ID DATA 0 0 6.0 1 1 12.0 2 2 13.0 3 3 15.0 4 4 10.0 5 5 23.0 6 6 24.0 7 7 30.0 8 8 32.0 9 9 25.0 10 10 38.0
Creating KBinsDiscretizer instance:
>>> binning = KBinsDiscretizer(cc, strategy='uniform_size', ... smoothing='means', ... bin_size=10)
Performing fit() on the given dataframe:
>>> binning.fit(df1, key='ID') >>> binning.result_.collect() ID BIN_INDEX DATA 0 0 1 8.000000 1 1 2 13.333333 2 2 2 13.333333 3 3 2 13.333333 4 4 1 8.000000 5 5 3 25.500000 6 6 3 25.500000 7 7 3 25.500000 8 8 4 35.000000 9 9 3 25.500000 10 10 4 35.000000
Input dataframe for transforming:
>>> df2.collect() ID DATA 0 0 6.0 1 1 67.0 2 2 4.0 3 3 12.0 4 4 -2.0 5 5 40.0
Performing transform() on the given dataframe:
>>> result = binning.transform(df2, key='ID') >>> result.collect() ID BIN_INDEX DATA 0 0 1 8.000000 1 1 -1 67.000000 2 2 1 8.000000 3 3 2 13.333333 4 4 1 8.000000 5 5 4 35.000000
Attributes: - result_ : DataFrame
Binned dataset from fit() and fit_transform().
- model_ :
Binning model content.
Methods
fit
(data, key[, features])Bin input data into number of intervals and smooth. fit_transform
(data, key[, features])Fit with the dataset and return the results. transform
(data, key[, features])Bin data based on the previous binning model. -
fit
(data, key, features=None)¶ Bin input data into number of intervals and smooth.
Parameters: - data : DataFrame
DataFrame to be discretized.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. Since the underlying PAL_BINNING only supports one feature, this list can only contain one element. If features is not provided, data must have exactly 1 non-ID column, and features defaults to that column.
-
fit_transform
(data, key, features=None)¶ Fit with the dataset and return the results.
Parameters: - data : DataFrame
DataFrame to be binned.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. Since the underlying PAL_BINNING only supports one feature, this list can only contain one element. If features is not provided, data must have exactly 1 non-ID column, and features defaults to that column.
Returns: - DataFrame
- Binned result, structured as follows:
- DATA_ID column, with same name and type as data’s ID column.
- BIN_INDEX, type INTEGER, assigned bin index.
- BINNING_DATA column, smoothed value, with same name and type as data’s feature column.
-
transform
(data, key, features=None)¶ Bin data based on the previous binning model.
Parameters: - data : DataFrame
DataFrame to be binned.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. Since the underlying PAL_BINNING_ASSIGNMENT only supports one feature, this list can only contain one element. If features is not provided, data must have exactly 1 non-ID column, and features defaults to that column.
Returns: - DataFrame
- Binned result, structured as follows:
- DATA_ID column, with same name and type as data’s ID column.
- BIN_INDEX, type INTEGER, assigned bin index.
- BINNING_DATA column, smoothed value, with same name and type as data’s feature column.
hana_ml.algorithms.pal.regression¶
This module contains wrappers for PAL regression algorithms.
The following classes are available:
-
class
hana_ml.algorithms.pal.regression.
GLM
(conn_context, family=None, link=None, solver=None, handle_missing_fit=None, quasilikelihood=None, max_iter=None, tol=None, significance_level=None, output_fitted=None, alpha=None, num_lambda=None, lambda_min_ratio=None, categorical_variable=None, ordering=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
Regression by a generalized linear model, based on PAL_GLM. Also supports ordinal regression.
Parameters: - conn_context : ConnectionContext
Connection to the HANA system.
- family : str, optional
The kind of distribution the dependent variable outcomes are assumed to be drawn from. Must be one of the following:
- ‘gaussian’
- ‘normal’ (synonym of ‘gaussian’)
- ‘poisson’
- ‘binomial’
- ‘gamma’
- ‘inversegaussian’
- ‘negativebinomial’
- ‘ordinal’ (for ordinal regression)
Defaults to ‘gaussian’.
- link : str, optional
GLM link function. Determines the relationship between the linear predictor and the predicted response. Default and allowed values depend on family. ‘inverse’ is accepted as a synonym of ‘reciprocal’.
family default link allowed values of link gaussian identity identity, log, reciprocal poisson log identity, log binomial logit logit, probit, comploglog, log gamma reciprocal identity, reciprocal, log inversegaussian inversesquare inversesquare, identity, reciprocal, log negativebinomial log identity, log, sqrt ordinal logit logit, probit, comploglog - solver : {‘irls’, ‘nr’, ‘cd’}, optional
Optimization algorithm to use.
- ‘irls’: Iteratively re-weighted least squares.
- ‘nr’: Newton-Raphson.
- ‘cd’: Coordinate descent. (Picking coordinate descent activates elastic net regularization.)
Defaults to ‘irls’, except when family is ‘ordinal’. Ordinal regression requires (and defaults to) ‘nr’, and Newton-Raphson is not supported for other values of family.
- handle_missing_fit : {‘skip’, ‘abort’, ‘fill_zero’}, optional
How to handle data rows with missing independent variable values during fitting.
- ‘skip’: Don’t use those rows for fitting.
- ‘abort’: Throw an error if missing independent variable values are found.
- ‘fill_zero’: Replace missing values with 0.
Defaults to ‘skip’.
- quasilikelihood : bool, optional
If True, enables the use of quasi-likelihood to estimate overdispersion. Defaults to False.
- max_iter : int, optional
Maximum number of optimization iterations. Defaults to 100 for IRLS and Newton-Raphson. Defaults to 100000 for coordinate descent.
- tol : float, optional
Stopping condition for optimization. Defaults to 1e-8 for IRLS, 1e-6 for Newton-Raphson, and 1e-7 for coordinate descent.
- significance_level : float, optional
Significance level for confidence intervals and prediction intervals. Defaults to 0.05.
- output_fitted : bool, optional
If True, create the fitted_ DataFrame of fitted response values for training data in fit.
- alpha : float, optional
Elastic net mixing parameter. Only accepted when using coordinate descent. Should be between 0 and 1 inclusive. Defaults to 1.0.
- num_lambda : int, optional
The number of lambda values. Only accepted when using coordinate descent. Defaults to 100.
- lambda_min_ratio : float, optional
The smallest value of lambda, as a fraction of the maximum lambda, where lambda_max is the smallest value for which all coefficients are zero. Only accepted when using coordinate descent. Defaults to 0.01 when the number of observations is smaller than the number of covariates, and 0.0001 otherwise.
- categorical_variable : list of str, optional
INTEGER columns specified in this list will be treated as categorical data. Other INTEGER columns will be treated as continuous.
- ordering : list of str or list of int, optional
Specifies the order of categories for ordinal regression. The default is numeric order for ints and alphabetical order for strings.
Examples
Training data:
>>> df.collect() ID Y X 0 1 0 -1 1 2 0 -1 2 3 1 0 3 4 1 0 4 5 1 0 5 6 1 0 6 7 2 1 7 8 2 1 8 9 2 1
Fitting a GLM on that data:
>>> glm = GLM(cc, solver='irls', family='poisson', link='log') >>> glm.fit(df, key='ID', label='Y')
Performing prediction:
>>> df2.collect() ID X 0 1 -1 1 2 0 2 3 1 3 4 2 >>> glm.predict(df2, key='ID')[['ID', 'PREDICTION']].collect() ID PREDICTION 0 1 0.25543735346197155 1 2 0.744562646538029 2 3 2.1702915689746476 3 4 6.32608352871737
Attributes: - statistics_ : DataFrame
Training statistics and model information other than the coefficients and covariance matrix.
- coef_ : DataFrame
Model coefficients.
- covmat_ : DataFrame
Covariance matrix. Set to None for coordinate descent.
- fitted_ : DataFrame
Predicted values for the training data. Set to None if output_fitted is False.
Methods
fit
(data[, key, features, label])Fit a generalized linear model based on training data. predict
(data, key[, features, …])Predict dependent variable values based on fitted model. score
(data, key[, features, label, …])Returns the coefficient of determination R^2 of the prediction. -
fit
(data, key=None, features=None, label=None)¶ Fit a generalized linear model based on training data.
Parameters: - data : DataFrame
Training data.
- key : str, optional
Name of the ID column. If key is not provided, it is assumed that the input has no ID column. Required when output_fitted is True.
- features : list of str, optional
Names of the feature columns. Defaults to all non-ID, non-label columns.
- label : str or list of str, optional
Name of the dependent variable. Defaults to the last column. (This is not the PAL default.) When family is ‘binomial’, label may be either a single column name or a list of two column names.
-
predict
(data, key, features=None, prediction_type=None, significance_level=None, handle_missing=None)¶ Predict dependent variable values based on fitted model.
Parameters: - data : DataFrame
Independent variable values to predict for.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. Defaults to all non-ID columns.
- prediction_type : {‘response’, ‘link’}, optional
Specifies whether to output predicted values of the response or the link function. Defaults to ‘response’.
- significance_level : float, optional
Significance level for confidence intervals and prediction intervals. If specified, overrides the value passed to the GLM constructor.
- handle_missing : {‘skip’, ‘fill_zero’}, optional
How to handle data rows with missing independent variable values.
- ‘skip’: Don’t perform prediction for those rows.
- ‘fill_zero’: Replace missing values with 0.
Defaults to ‘skip’.
Returns: - DataFrame
Predicted values, structured as follows. The following two columns are always populated:
- ID column, with same name and type as data’s ID column.
- PREDICTION, type NVARCHAR(100), representing predicted values.
- The following five columns are only populated for IRLS:
- SE, type DOUBLE. Standard error, or for ordinal regression, the probability that the data point belongs to the predicted category.
- CI_LOWER, type DOUBLE. Lower bound of the confidence interval.
- CI_UPPER, type DOUBLE. Upper bound of the confidence interval.
- PI_LOWER, type DOUBLE. Lower bound of the prediction interval.
- PI_UPPER, type DOUBLE. Upper bound of the prediction interval.
-
score
(data, key, features=None, label=None, prediction_type=None, handle_missing=None)¶ Returns the coefficient of determination R^2 of the prediction.
Not applicable for ordinal regression.
Parameters: - data : DataFrame
Data on which to assess model performance.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. Defaults to all non-ID, non-label columns.
- label : str, optional
Name of the dependent variable. Defaults to the last column. (This is not the PAL default.) Cannot be two columns, even for family=’binomial’.
- prediction_type : {‘response’, ‘link’}, optional
Specifies whether to predict the value of the response or the link function. The contents of the label column should match this choice. Defaults to ‘response’.
- handle_missing : {‘skip’, ‘fill_zero’}, optional
How to handle data rows with missing independent variable values.
- ‘skip’: Don’t perform prediction for those rows. Those rows will be left out of the R^2 computation.
- ‘fill_zero’: Replace missing values with 0.
Defaults to ‘skip’.
Returns: - accuracy : float
The coefficient of determination R^2 of the prediction on the given data.
-
class
hana_ml.algorithms.pal.regression.
PolynomialRegression
(conn_context, degree, decomposition=None, adjusted_r2=None, pmml_export=None, thread_ratio=None)¶ Bases:
hana_ml.algorithms.pal.pal_base.PALBase
A univariate polynomial regression model, based on PAL_POLYNOMIAL_REGRESSION.
Parameters: - conn_context : ConnectionContext
Connection to the HANA system.
- degree : int
Degree of the polynomial model.
- decomposition : {‘LU’, ‘SVD’}, optional
- Matrix factorization type to use. Case-insensitive.
- ‘LU’: LU decomposition.
- ‘SVD’: singular value decomposition.
Defaults to LU decomposition.
- adjusted_r2 : boolean, optional
If true, include the adjusted R^2 value in the statistics table. Defaults to False.
- pmml_export : {‘no’, ‘single-row’, ‘multi-row’}, optional
Controls whether to output a PMML representation of the model, and how to format the PMML. Case-insensitive.
- ‘no’ or not provided: No PMML model.
- ‘single-row’: Exports a PMML model in a maximum of one row. Fails if the model doesn’t fit in one row.
- ‘multi-row’: Exports a PMML model, splitting it across multiple rows if it doesn’t fit in one.
Prediction does not require a PMML model.
- thread_ratio : float, optional
Controls the proportion of available threads to use for prediction. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Does not affect fitting. Defaults to 0.
Examples
Training data (based on y = x^3 - 2x^2 + 3x + 5, with noise):
>>> df.collect() ID X Y 0 1 0.0 5.048 1 2 1.0 7.045 2 3 2.0 11.003 3 4 3.0 23.072 4 5 4.0 49.041
Training the model:
>>> pr = PolynomialRegression(cc, degree=3) >>> pr.fit(df, key='ID')
Prediction:
>>> df2.collect() ID X 0 1 0.5 1 2 1.5 2 3 2.5 3 4 3.5 >>> pr.predict(df2, key='ID').collect() ID VALUE 0 1 6.157063 1 2 8.401269 2 3 15.668581 3 4 33.928501
Ideal output:
>>> df2.select('ID', ('POWER(X, 3)-2*POWER(X, 2)+3*x+5', 'Y')).collect() ID Y 0 1 6.125 1 2 8.375 2 3 15.625 3 4 33.875
Attributes: - coefficients_ : DataFrame
Fitted regression coefficients.
- pmml_ : DataFrame
PMML model. Set to None if no PMML model was requested.
- fitted_ : DataFrame
Predicted dependent variable values for training data. Set to None if the training data has no row IDs.
- statistics_ : DataFrame
Regression-related statistics, such as mean squared error.
Methods
fit
(data[, key, features, label])Fit regression model based on training data. predict
(data, key[, features])Predict dependent variable values based on fitted model. score
(data, key[, features, label])Returns the coefficient of determination R^2 of the prediction. -
fit
(data, key=None, features=None, label=None)¶ Fit regression model based on training data.
Parameters: - data : DataFrame
Training data.
- key : str, optional
Name of the ID column. If key is not provided, it is assumed that the input has no ID column.
- features : list of str, optional
Names of the feature columns. Since the underlying PAL_POLYNOMIAL_REGRESSION only supports one feature, this list can only contain one element. If features is not provided, data must have exactly 1 non-ID, non-label column, and features defaults to that column.
- label : str, optional
Name of the dependent variable. Defaults to the last column. (This is not the PAL default.)
-
predict
(data, key, features=None)¶ Predict dependent variable values based on fitted model.
Parameters: - data : DataFrame
Independent variable values used for prediction.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. Since the underlying PAL_POLYNOMIAL_REGRESSION only supports one feature, this list can only contain one element. If features is not provided, data must have exactly 1 non-ID column, and features defaults to that column.
Returns: - DataFrame
- Predicted values, structured as follows:
- ID column, with same name and type as data’s ID column.
- VALUE, type DOUBLE, representing predicted values.
Notes
predict() will pass the pmml_ table to PAL as the model representation if there is a pmml_ table, or the coefficients_ table otherwise.
-
score
(data, key, features=None, label=None)¶ Returns the coefficient of determination R^2 of the prediction.
Parameters: - data : DataFrame
Data on which to assess model performance.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. Since the underlying PAL_POLYNOMIAL_REGRESSION_PREDICT only supports one feature, this list can only contain one element. If features is not provided, data must have exactly 1 non-ID, non-label column, and features defaults to that column.
- label : str, optional
Name of the dependent variable. Defaults to the last column. (This is not the PAL default.)
Returns: - accuracy : float
The coefficient of determination R^2 of the prediction on the given data.
hana_ml.algorithms.pal.stats¶
This module contains PAL wrappers for statistics algorithms.
The following functions are available:
-
hana_ml.algorithms.pal.stats.
chi_squared_goodness_of_fit
(conn_context, data, key, observed_data=None, expected_freq=None)¶ Perform the chi-squared goodness-of fit test to tell whether or not an observed distribution differs from an expected chi-squared distribution.
Parameters: - conn_context : ConnectionContext
Database connection object.
- data : DataFrame
Input data.
- key : str
Name of the ID column.
- observed_data : str, optional
Name of column for counts of actual observations belonging to each category. If not given, the input dataframe must only have three columns. The first of the non-ID columns will be observed_data.
- expected_freq : str, optional
Name of the expected frequency column. If not given, the input dataframe must only have three columns. The second of the non-ID columns will be expected_freq.
Returns: - count_comparison_df : DataFrame
Comparsion between the actual counts and the expected counts, structured as follows:
- ID column, with same name and type as data’s ID column.
- Observed data column, with same name as data’s observed_data column, but always with type DOUBLE.
- EXPECTED, type DOUBLE, expected count in each category.
- RESIDUAL, type DOUBLE, the difference between the observed counts and the expected counts.
- stat_df : DataFrame
Statistical outputs, including the calculated chi-squared value, degrees of freedom and p-value, structured as follows:
- STAT_NAME, type NVARCHAR(100), name of statistics.
- STAT_VALUE, type DOUBLE, value of statistics.
Examples
Data to test:
>>> df = cc.table('PAL_CHISQTESTFIT_DATA_TBL') >>> df.collect() ID OBSERVED P 0 0 519.0 0.3 1 1 364.0 0.2 2 2 363.0 0.2 3 3 200.0 0.1 4 4 212.0 0.1 5 5 193.0 0.1
Perform chi_squared_goodness_of_fit:
>>> res, stat = chi_squared_goodness_of_fit(cc, df, 'ID') >>> res.collect() ID OBSERVED EXPECTED RESIDUAL 0 0 519.0 555.3 -36.3 1 1 364.0 370.2 -6.2 2 2 363.0 370.2 -7.2 3 3 200.0 185.1 14.9 4 4 212.0 185.1 26.9 5 5 193.0 185.1 7.9 >>> stat.collect() STAT_NAME STAT_VALUE 0 Chi-squared Value 8.062669 1 degree of freedom 5.000000 2 p-value 0.152815
-
hana_ml.algorithms.pal.stats.
chi_squared_independence
(conn_context, data, key, observed_data=None, correction=False)¶ Perform the chi-squared test of independence to tell whether observations of two variables are independent from each other.
Parameters: - conn_context : ConnectionContext
Database connection object.
- data : DataFrame
Input data.
- key : str
Name of the ID column.
- observed_data : list of str, optional
Names of the observed data columns. If not given, it defaults to the all the non-ID columns.
- correction : bool, optional
If True, and the degrees of freedom is 1, apply Yates’s correction for continuity. The effect of the correction is to adjust each observed value by 0.5 towards the corresponding expected value. Defaults to False.
Returns: - expected_count_df : DataFrame
- The expected count table, structured as follows:
- ID column, with same name and type as data’s ID column.
- Expected count columns, named by prepending
Expected_
to each observed_data column name, type DOUBLE. There will be as many columns here as there are observed_data columns.
- stat_df : DataFrame
Statistical outputs, including the calculated chi-squared value, degrees of freedom and p-value, structured as follows:
- STAT_NAME, type NVARCHAR(100), name of statistics.
- STAT_VALUE, type DOUBLE, value of statistics.
Examples
Data to test:
>>> df = cc.table('PAL_CHISQTESTIND_DATA_TBL') >>> df.collect() ID X1 X2 X3 X4 0 male 25 23.0 11 14.0 1 female 41 20.0 18 6.0
Perform chi-squared test of independence:
>>> res, stats = chi_squared_independence(cc, df, 'ID') >>> res.collect() ID EXPECTED_X1 EXPECTED_X2 EXPECTED_X3 EXPECTED_X4 0 male 30.493671 19.867089 13.398734 9.240506 1 female 35.506329 23.132911 15.601266 10.759494 >>> stats.collect() STAT_NAME STAT_VALUE 0 Chi-squared Value 8.113152 1 degree of freedom 3.000000 2 p-value 0.043730
-
hana_ml.algorithms.pal.stats.
covariance_matrix
(conn_context, data, cols=None)¶ Computes the covariance matrix.
Parameters: - conn_context : ConnectionContext
Connection to the HANA system.
- data : DataFrame
Input data.
- cols : list of str, optional
List of column names to analyze. If ‘cols’ is not provided, it defaults to all columns.
Returns: - covariance_matrix : DataFrame
- Covariance between any two data samples (columns).
- ID, type NVARCHAR. The values of this column are the column names from cols.
- Covariance columns, type DOUBLE, named after the columns in cols. The covariance between variables X and Y is in column X, in the row with ID value Y.
Examples
Dataset to be analyzed:
>>> df.collect() X Y 0 1 2.4 1 5 3.5 2 3 8.9 3 10 -1.4 4 -4 -3.5 5 11 32.8
Compute the covariance matrix:
>>> result = covariance_matrix(conn, df)
Outputs:
>>> result.collect() ID X Y 0 X 31.866667 44.473333 1 Y 44.473333 176.677667
-
hana_ml.algorithms.pal.stats.
f_oneway
(conn_context, data, group=None, sample=None, multcomp_method=None, significance_level=None)¶ Performs a 1-way ANOVA.
The purpose of one-way ANOVA is to determine whether there is any statistically significant difference between the means of three or more independent groups.
Parameters: - conn_context : ConnectionContext
Connection to the HANA system.
- data : DataFrame
Input data.
- group : str
Name of the group column. If group is not provided, defaults to the first column.
- sample : str, optional
Name of the sample measurement column. If sample is not provided, data must have exactly 1 non-group column and sample defaults to that column.
- multcomp_method : str, optional
Method used to perform multiple comparison tests. Should be one of the following:
- ‘tukey-kramer’
- ‘bonferroni’
- ‘dunn-sidak’
- ‘scheffe’
- ‘fisher-lsd’
Defaults to tukey-kramer.
- significance_level : float, optional
The significance level when the function calculates the confidence interval in multiple comparison tests. Values must be greater than 0 and less than 1. Defaults to 0.05.
Returns: - statistics_df : DataFrame
- Statistics for each group, structured as follows:
- GROUP, type NVARCHAR(256), group name.
- VALID_SAMPLES, type INTEGER, number of valid samples.
- MEAN, type DOUBLE, group mean.
- SD, type DOUBLE, group standard deviation.
- ANOVA_df : DataFrame
- Computed results for ANOVA, structured as follows:
- VARIABILITY_SOURCE, type NVARCHAR(100), source of variability, including between groups, within groups (error) and total.
- SUM_OF_SQUARES, type DOUBLE, sum of squares.
- DEGREES_OF_FREEDOM, type DOUBLE, degrees of freedom.
- MEAN_SQUARES, type DOUBLE, mean squares.
- F_RATIO, type DOUBLE, calculated as mean square between groups divided by mean square of error.
- P_VALUE, type DOUBLE, associated p-value from the F-distribution.
- multiple_comparison_df : DataFrame
- Multiple comparison results, structured as follows:
- FIRST_GROUP, type NVARCHAR(256), the name of the first group to conduct pairwise test on.
- SECOND_GROUP, type NVARCHAR(256), the name of the second group to conduct pairwise test on.
- MEAN_DIFFERENCE, type DOUBLE, mean difference between the two groups.
- SE, type DOUBLE, standard error computed from all data.
- P_VALUE, type DOUBLE, p-value.
- CI_LOWER, type DOUBLE, the lower limit of the confidence interval.
- CI_UPPER, type DOUBLE, the upper limit of the confidence interval.
Examples
Samples for One Way ANOVA test:
>>> df.collect() GROUP DATA 0 A 4.0 1 A 5.0 2 A 4.0 3 A 3.0 4 A 2.0 5 A 4.0 6 A 3.0 7 A 4.0 8 B 6.0 9 B 8.0 10 B 4.0 11 B 5.0 12 B 4.0 13 B 6.0 14 B 5.0 15 B 8.0 16 C 6.0 17 C 7.0 18 C 6.0 19 C 6.0 20 C 7.0 21 C 5.0
Perform one-way ANOVA test:
>>> stats, anova, mult_comp= f_oneway(conn, df, ... multcomp_method='Tukey-Kramer', ... significance_level=0.05)
Outputs:
>>> stats.collect() GROUP VALID_SAMPLES MEAN SD 0 A 8 3.625000 0.916125 1 B 8 5.750000 1.581139 2 C 6 6.166667 0.752773 3 Total 22 5.090909 1.600866 >>> anova.collect() VARIABILITY_SOURCE SUM_OF_SQUARES DEGREES_OF_FREEDOM MEAN_SQUARES \ 0 Group 27.609848 2.0 13.804924 1 Error 26.208333 19.0 1.379386 2 Total 53.818182 21.0 NaN F_RATIO P_VALUE 0 10.008021 0.001075 1 NaN NaN 2 NaN NaN >>> mult_comp.collect() FIRST_GROUP SECOND_GROUP MEAN_DIFFERENCE SE P_VALUE CI_LOWER \ 0 A B -2.125000 0.587236 0.004960 -3.616845 1 A C -2.541667 0.634288 0.002077 -4.153043 2 B C -0.416667 0.634288 0.790765 -2.028043 CI_UPPER 0 -0.633155 1 -0.930290 2 1.194710
-
hana_ml.algorithms.pal.stats.
f_oneway_repeated
(conn_context, data, subject_id, measures=None, multcomp_method=None, significance_level=None, se_type=None)¶ Performs one-way repeated measures analysis of variance, along with Mauchly’s Test of Sphericity and post hoc multiple comparison tests.
Parameters: - conn_context : ConnectionContext
Connection to the HANA system.
- data : DataFrame
Input data.
- subject_id : str
Name of the subject ID column. The algorithm treats each row of the data table as a different subject. Hence there should be no duplicate subject IDs in this column.
- measures : list of str, optional
Names of the groups (measures). If measures is not provided, defaults to all non-subject_id columns.
- multcomp_method : str, optional
Method used to perform multiple comparison tests. Should be one of the following:
- ‘tukey-kramer’
- ‘bonferroni’
- ‘dunn-sidak’
- ‘scheffe’
- ‘fisher-lsd’
Defaults to bonferroni.
- significance_level : float, optional
The significance level when the function calculates the confidence interval in multiple comparison tests. Values must be greater than 0 and less than 1. Defaults to 0.05.
- se_type : {‘all-data’, ‘two-group’}
- Type of standard error used in multiple comparison tests.
- ‘all-data’: computes the standard error from all data. It has more power if the assumption of sphericity is true, especially with small data sets.
- ‘two-group’: computes the standard error from only the two groups being compared. It doesn’t assume sphericity.
Defaults to two-group.
Returns: - statistics_df : DataFrame
- Statistics for each group, structured as follows:
- GROUP, type NVARCHAR(256), group name.
- VALID_SAMPLES, type INTEGER, number of valid samples.
- MEAN, type DOUBLE, group mean.
- SD, type DOUBLE, group standard deviation.
- Mauchly_test_df : DataFrame
- Mauchly test results, structured as follows:
- STAT_NAME, type NVARCHAR(100), names of test result quantities.
- STAT_VALUE, type DOUBLE, values of test result quantities.
- ANOVA_df : DataFrame
- Computed results for ANOVA, structured as follows:
- VARIABILITY_SOURCE, type NVARCHAR(100), source of variability, divided into group, error and subject portions.
- SUM_OF_SQUARES, type DOUBLE, sum of squares.
- DEGREES_OF_FREEDOM, type DOUBLE, degrees of freedom.
- MEAN_SQUARES, type DOUBLE, mean squares.
- F_RATIO, type DOUBLE, calculated as mean square between groups divided by mean square of error.
- P_VALUE, type DOUBLE, associated p-value from the F-distribution.
- P_VALUE_GG, type DOUBLE, p-value of Greehouse-Geisser correction.
- P_VALUE_HF, type DOUBLE, p-value of Huynh-Feldt correction.
- P_VALUE_LB, type DOUBLE, p-value of lower bound correction.
- multiple_comparison_df : DataFrame
- Multiple comparison results, structured as follows:
- FIRST_GROUP, type NVARCHAR(256), the name of the first group to conduct pairwise test on.
- SECOND_GROUP, type NVARCHAR(256), the name of the second group to conduct pairwise test on.
- MEAN_DIFFERENCE, type DOUBLE, mean difference between the two groups.
- SE, type DOUBLE, standard error computed from all data or compared two groups, depending on se_type.
- P_VALUE, type DOUBLE, p-value.
- CI_LOWER, type DOUBLE, the lower limit of the confidence interval.
- CI_UPPER, type DOUBLE, the upper limit of the confidence interval.
Examples
Samples for One Way Repeated ANOVA test:
>>> df.collect() ID MEASURE1 MEASURE2 MEASURE3 MEASURE4 0 1 8.0 7.0 1.0 6.0 1 2 9.0 5.0 2.0 5.0 2 3 6.0 2.0 3.0 8.0 3 4 5.0 3.0 1.0 9.0 4 5 8.0 4.0 5.0 8.0 5 6 7.0 5.0 6.0 7.0 6 7 10.0 2.0 7.0 2.0 7 8 12.0 6.0 8.0 1.0
Perform one-way repeated measures ANOVA test:
>>> stats, mtest, anova, mult_comp = f_oneway_repeated( ... conn, ... df, ... subject_id='ID', ... multcomp_method='bonferroni', ... significance_level=0.05, ... se_type='two-group')
Outputs:
>>> stats.collect() GROUP VALID_SAMPLES MEAN SD 0 MEASURE1 8 8.125 2.232071 1 MEASURE2 8 4.250 1.832251 2 MEASURE3 8 4.125 2.748376 3 MEASURE4 8 5.750 2.915476 >>> mtest.collect() STAT_NAME STAT_VALUE 0 Mauchly's W 0.136248 1 Chi-Square 11.405981 2 df 5.000000 3 pValue 0.046773 4 Greenhouse-Geisser Epsilon 0.532846 5 Huynh-Feldt Epsilon 0.665764 6 Lower bound Epsilon 0.333333 >>> anova.collect() VARIABILITY_SOURCE SUM_OF_SQUARES DEGREES_OF_FREEDOM MEAN_SQUARES \ 0 Group 83.125 3.0 27.708333 1 Subject 17.375 7.0 2.482143 2 Error 153.375 21.0 7.303571 F_RATIO P_VALUE P_VALUE_GG P_VALUE_HF P_VALUE_LB 0 3.793806 0.02557 0.062584 0.048331 0.092471 1 NaN NaN NaN NaN NaN 2 NaN NaN NaN NaN NaN >>> mult_comp.collect() FIRST_GROUP SECOND_GROUP MEAN_DIFFERENCE SE P_VALUE CI_LOWER \ 0 MEASURE1 MEASURE2 3.875 0.811469 0.012140 0.924655 1 MEASURE1 MEASURE3 4.000 0.731925 0.005645 1.338861 2 MEASURE1 MEASURE4 2.375 1.792220 1.000000 -4.141168 3 MEASURE2 MEASURE3 0.125 1.201747 1.000000 -4.244322 4 MEASURE2 MEASURE4 -1.500 1.336306 1.000000 -6.358552 5 MEASURE3 MEASURE4 -1.625 1.821866 1.000000 -8.248955 CI_UPPER 0 6.825345 1 6.661139 2 8.891168 3 4.494322 4 3.358552 5 4.998955
-
hana_ml.algorithms.pal.stats.
pearsonr_matrix
(conn_context, data, cols=None)¶ Computes a correlation matrix using Pearson’s correlation coefficient.
Parameters: - conn_context : ConnectionContext
Connection to the HANA system.
- data : DataFrame
Input data.
- cols : list of str, optional
List of column names to analyze. If ‘cols’ is not provided, it defaults to all columns.
Returns: - pearsonr_matrix : DataFrame
Pearson’s correlation coefficient between any two data samples (columns).
- ID, type NVARCHAR. The values of this column are the column names from cols.
- Correlation coefficient columns, type DOUBLE, named after the columns in cols. The correlation coefficient between variables X and Y is in column X, in the row with ID value Y.
Examples
Dataset to be analyzed:
>>> df.collect() X Y 0 1 2.4 1 5 3.5 2 3 8.9 3 10 -1.4 4 -4 -3.5 5 11 32.8
Compute the Pearson’s correlation coefficient matrix:
>>> result = pearsonr_matrix(conn, df)
Outputs:
>>> result.collect() ID X Y 0 X 1 0.592707653621 1 Y 0.592707653621 1
-
hana_ml.algorithms.pal.stats.
ttest_1samp
(conn_context, data, col=None, mu=0, test_type='two_sides', conf_level=0.95)¶ Perform the t-test to determine whether a sample of observations could have been generated by a process with a specific mean.
Parameters: - conn_context : ConnectionContext
Database connection object.
- data : DataFrame
DataFrame containing the data.
- col : str, optional
Name of the column for sample. If not given, the input dataframe must only have one column.
- mu : float, optional
Hypothesized mean of the population underlying the sample. Default value: 0
- test_type : string, optional
- The alternative hypothesis type.
- ‘two_sides’
- ‘less’
- ‘greater’
Default value: two_sides
- conf_level : float, optional
Confidence level for alternative hypothesis confidence interval. Default value: 0.95
Returns: - stat_df : DataFrame
DataFrame containing the statistics results from the t-test.
Examples
Original data:
>>> df.collect() X1 0 1.0 1 2.0 2 4.0 3 7.0 4 3.0
Perform One Sample T-Test
>>> ttest_1samp(conn, df).collect() STAT_NAME STAT_VALUE 0 t-value 3.302372 1 degree of freedom 4.000000 2 p-value 0.029867 3 _PAL_MEAN_X1_ 3.400000 4 confidence level 0.950000 5 lowerLimit 0.541475 6 upperLimit 6.258525
-
hana_ml.algorithms.pal.stats.
ttest_ind
(conn_context, data, col1=None, col2=None, mu=0, test_type='two_sides', var_equal=False, conf_level=0.95)¶ Perform the T-test for the mean difference of two independent samples.
Parameters: - conn_context : ConnectionContext
Connection to the HANA system.
- data : DataFrame
DataFrame containing the data.
- col1 : str, optional
Name of the column for sample1. If not given, the input dataframe must only have two columns. The first of the columns will be col1.
- col2 : str, optional
Name of the column for sample2. If not given, the input dataframe must only have two columns. The second of the columns will be col2.
- mu : float, optional
Hypothesized difference between the two underlying population means. Default value: 0
- test_type : string, optional
- The alternative hypothesis type.
- ‘two_sides’
- ‘less’
- ‘greater’
Default value: two_sides
- var_equal : bool, optional
Controls whether to assume that the two samples have equal variance. Default value: False
- conf_level : float, optional
Confidence level for alternative hypothesis confidence interval. Default value: 0.95
Returns: - stat_df : DataFrame
DataFrame containing the statistics results from the t-test.
Examples
Original data:
>>> df.collect() X1 X2 0 1.0 10.0 1 2.0 12.0 2 4.0 11.0 3 7.0 15.0 4 NaN 10.0
Perform Independent Sample T-Test
>>> ttest_ind(conn, df).collect() STAT_NAME STAT_VALUE 0 t-value -5.013774 1 degree of freedom 5.649757 2 p-value 0.002875 3 _PAL_MEAN_X1_ 3.500000 4 _PAL_MEAN_X2_ 11.600000 5 confidence level 0.950000 6 lowerLimit -12.113278 7 upperLimit -4.086722
-
hana_ml.algorithms.pal.stats.
ttest_paired
(conn_context, data, col1=None, col2=None, mu=0, test_type='two_sides', conf_level=0.95)¶ Perform the t-test for the mean difference of two sets of paired samples.
Parameters: - conn_context : ConnectionContext
Database connection object.
- data : DataFrame
DataFrame containing the data.
- col1 : str, optional
Name of the column for sample1. If not given, the input dataframe must only have two columns. The first of two columns will be col1.
- col2 : str, optional
Name of the column for sample2. If not given, the input dataframe must only have two columns. The second of the two columns will be col2.
- mu : float, optional
Hypothesized difference between two underlying population means. Default value: 0
- test_type : string, optional
- The alternative hypothesis type.
- ‘two_sides’
- ‘less’
- ‘greater’
Default value: two_sides
- conf_level : float, optional
Confidence level for alternative hypothesis confidence interval. Default value: 0.95
Returns: - stat_df : DataFrame
- DataFrame containing the statistics results from the t-test.
Examples
Original data:
>>> df.collect() X1 X2 0 1.0 10.0 1 2.0 12.0 2 4.0 11.0 3 7.0 15.0 4 3.0 10.0
perform Paired Sample T-Test
>>> ttest_paired(conn, df).collect() STAT_NAME STAT_VALUE 0 t-value -14.062884 1 degree of freedom 4.000000 2 p-value 0.000148 3 _PAL_MEAN_DIFFERENCES_ -8.200000 4 confidence level 0.950000 5 lowerLimit -9.818932 6 upperLimit -6.581068
-
hana_ml.algorithms.pal.stats.
univariate_analysis
(conn_context, data, key=None, cols=None, categorical_variable=None, significance_level=None, trimmed_percentage=None)¶ Provides an overview of the dataset. For continuous columns, it provides the count of valid observations, min, lower quartile, median, upper quartile, max, mean, confidence interval for the mean (lower and upper bound), trimmed mean, variance, standard deviation, skewness, and kurtosis. For discrete columns, it provides the number of occurrences and the percentage of the total data in each category.
Parameters: - conn_context : ConnectionContext
Connection to the HANA system.
- data : DataFrame
Input data.
- key : str, optional
Name of the ID column. If key is not provided, it is assumed that the input has no ID column.
- cols : list of str, optional
List of column names to analyze. If cols is not provided, it defaults to all non-ID columns.
- categorical_variable : list of str, optional
INTEGER columns specified in this list will be treated as categorical data. By default, INTEGER columns are treated as continuous.
- significance_level : float, optional
The significance level when the function calculates the confidence interval of the sample mean. Values must be greater than 0 and less than 1. Defaults to 0.05.
- trimmed_percentage : float, optional
The ratio of data at both head and tail that will be dropped in the process of calculating the trimmed mean. Value range is from 0 to 0.5. Defaults to 0.05.
Returns: - continuous_result : DataFrame
- Statistics for continuous variables, structured as follows:
- VARIABLE_NAME, type NVARCHAR(256), variable names.
- STAT_NAME, type NVARCHAR(100), names of statistical quantities, including the count of valid observations, min, lower quartile, median, upper quartile, max, mean, confidence interval for the mean (lower and upper bound), trimmed mean, variance, standard deviation, skewness, and kurtosis (14 quantities in total).
- STAT_VALUE, type DOUBLE, values for the corresponding statistical quantities.
- categorical_result : DataFrame
- Statistics for categorical variables, structured as follows:
- VARIABLE_NAME, type NVARCHAR(256), variable names.
- CATEGORY, type NVARCHAR(256), category names of the corresponding variables. Null is also treated as a category.
- STAT_NAME, type NVARCHAR(100), names of statistical quantities: number of observations, percentage of total data points falling in the current category for a variable (including null).
- STAT_VALUE, type DOUBLE, values for the corresponding statistical quantities.
Examples
Dataset to be analyzed:
>>> df.collect() X1 X2 X3 X4 0 1.2 None 1 A 1 2.5 None 2 C 2 5.2 None 3 A 3 -10.2 None 2 A 4 8.5 None 2 C 5 100.0 None 3 B
Perform univariate analysis:
>>> continuous, categorical = univariate_analysis( ... conn, ... df, ... categorical_variable=['X3'], ... significance_level=0.05, ... trimmed_percentage=0.2)
Outputs:
>>> continuous.collect() VARIABLE_NAME STAT_NAME STAT_VALUE 0 X1 valid observations 6.000000 1 X1 min -10.200000 2 X1 lower quartile 1.200000 3 X1 median 3.850000 4 X1 upper quartile 8.500000 5 X1 max 100.000000 6 X1 mean 17.866667 7 X1 CI for mean, lower bound -24.879549 8 X1 CI for mean, upper bound 60.612883 9 X1 trimmed mean 4.350000 10 X1 variance 1659.142667 11 X1 standard deviation 40.732575 12 X1 skewness 1.688495 13 X1 kurtosis 1.036148 14 X2 valid observations 0.000000
>>> categorical.collect() VARIABLE_NAME CATEGORY STAT_NAME STAT_VALUE 0 X3 __PAL_NULL__ count 0.000000 1 X3 __PAL_NULL__ percentage(%) 0.000000 2 X3 1 count 1.000000 3 X3 1 percentage(%) 16.666667 4 X3 2 count 3.000000 5 X3 2 percentage(%) 50.000000 6 X3 3 count 2.000000 7 X3 3 percentage(%) 33.333333 8 X4 __PAL_NULL__ count 0.000000 9 X4 __PAL_NULL__ percentage(%) 0.000000 10 X4 A count 3.000000 11 X4 A percentage(%) 50.000000 12 X4 B count 1.000000 13 X4 B percentage(%) 16.666667 14 X4 C count 2.000000 15 X4 C percentage(%) 33.333333
hana_ml.algorithms.pal.svm¶
This module contains PAL wrapper and helper functions for Support Vector Machine algorithms.
The following classes are available:
-
class
hana_ml.algorithms.pal.svm.
OneClassSVM
(conn_context, c=None, kernel='rbf', degree=None, gamma=None, coef_lin=None, coef_const=None, shrink=True, tol=None, evaluation_seed=None, thread_ratio=None, nu=None, scale_info=None, categorical_variable=None, category_weight=None)¶ Bases:
hana_ml.algorithms.pal.svm._SVMBase
One Class SVM
Parameters: - conn_context : ConnectionContext
Connection to the HANA system.
- c : float, optional
Trade-off between training error and margin. Value range: > 0. Defaults to 100.0.
- kernel : {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’}, optional
Defaults to rbf.
- degree : int, optional
Coefficient for the poly kernel type. Value range: >= 1. Defaults to 3.
- gamma : float, optional
Coefficient for the rbf kernel type. Defaults to to 1.0/number of features in the dataset. Only valid when kernel is rbf.
- coef_lin : float, optional
Coefficient for the poly/sigmoid kernel type. Defaults to 0.
- coef_const : float, optional
Coefficient for the poly/sigmoid kernel type. Defaults to 0.
- shrink : bool, optional
If true, use shrink strategy. Defaults to True.
- tol : float, optional
Specifies the error tolerance in the training process. Value range: > 0. Defaults to 0.001.
- evaluation_seed : int, optional
The random seed in parameter selection. Value range: >= 0. Defaults to 0.
- thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.0.
- nu : float, optional
The value for both the upper bound of the fraction of training errors and the lower bound of the fraction of support vectors. Defaults to 0.5.
- scale_info : {‘no’, ‘standardization’, ‘rescale’}, optional
Options:
- ‘no’ : No scale.
- ‘standardization’ : Transforms the data to have zero mean and unit variance.
- ‘rescale’ : Rescales the range of the features to scale the range in [-1,1].
Defaults to standardization.
- categorical_variable : list of str, optional
Column names in the data table used as category variable.
- category_weight : float, optional
Represents the weight of category attributes. Value range: > 0. Defaults to 0.707.
Examples
Training data:
>>> df_fit.head(10).collect() ID ATTRIBUTE1 ATTRIBUTE2 ATTRIBUTE3 ATTRIBUTE4 0 0 1.0 10.0 100.0 A 1 1 1.1 10.1 100.0 A 2 2 1.2 10.2 100.0 A 3 3 1.3 10.4 100.0 A 4 4 1.2 10.3 100.0 AB 5 5 4.0 40.0 400.0 AB 6 6 4.1 40.1 400.0 AB 7 7 4.2 40.2 400.0 AB 8 8 4.3 40.4 400.0 AB 9 9 4.2 40.3 400.0 AB
Create OneClassSVM instance and call fit:
>>> svc_one = svm.OneClassSVM(conn, scale_info='no', category_weight=1) >>> svc_one.fit(df_fit, 'ID', ['ATTRIBUTE1', 'ATTRIBUTE2', 'ATTRIBUTE3', ... 'ATTRIBUTE4']) >>> df_predict = conn.table("DATA_TBL_SVC_ONE_PREDICT") >>> df_predict.head(10).collect() ID ATTRIBUTE1 ATTRIBUTE2 ATTRIBUTE3 ATTRIBUTE4 0 0 1.0 10.0 100.0 A 1 1 1.1 10.1 100.0 A 2 2 1.2 10.2 100.0 A 3 3 1.3 10.4 100.0 A 4 4 1.2 10.3 100.0 AB 5 5 4.0 40.0 400.0 AB 6 6 4.1 40.1 400.0 AB 7 7 4.2 40.2 400.0 AB 8 8 4.3 40.4 400.0 AB 9 9 4.2 40.3 400.0 AB >>> features = ['ATTRIBUTE1', 'ATTRIBUTE2', 'ATTRIBUTE3', ... 'ATTRIBUTE4']
Call predict:
>>> svc_one.predict(df_predict, 'ID', features).head(10).collect() ID SCORE PROBABILITY 0 0 -1 None 1 1 1 None 2 2 1 None 3 3 -1 None 4 4 -1 None 5 5 -1 None 6 6 -1 None 7 7 1 None 8 8 -1 None 9 9 -1 None
Attributes: - model_ : DataFrame
Model content.
Methods
fit
(data[, key, features])Fit the model when given training dataset and other attributes. predict
(data, key[, features])Predict the dataset using the trained model. -
fit
(data, key=None, features=None)¶ Fit the model when given training dataset and other attributes.
Parameters: - data : DataFrame
DataFrame containing the data.
- key : str
Name of the ID column. If key is not provided, it is assumed that the input has no ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.
-
predict
(data, key, features=None)¶ Predict the dataset using the trained model.
Parameters: - data : DataFrame
DataFrame containing the data.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.
Returns: - DataFrame
- Predict result, structured as follows:
- ID column, with the same name and type as data’s ID column.
- Score, type NVARCHAR(100), prediction value.
- PROBABILITY, type DOUBLE, prediction probability. Always NULL. This column is only used for SVC and SVRanking.
-
class
hana_ml.algorithms.pal.svm.
SVC
(conn_context, c=None, kernel='rbf', degree=None, gamma=None, coef_lin=None, coef_const=None, probability=False, shrink=True, tol=None, evaluation_seed=None, thread_ratio=None, scale_info=None, categorical_variable=None, category_weight=None)¶ Bases:
hana_ml.algorithms.pal.svm._SVMBase
Support Vector Classification
Parameters: - conn_context : ConnectionContext
Connection to the HANA system.
- c : float, optional
Trade-off between training error and margin. Value range: > 0. Defaults to 100.0.
- kernel : {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’}, optional
Defaults to rbf.
- degree : int, optional
Coefficient for the poly kernel type. Value range: >= 1. Defaults to 3.
- gamma : float, optional
Coefficient for the rbf kernel type. Defaults to 1.0/number of features in the dataset. Only valid for when kernel is rbf.
- coef_lin : float, optional
Coefficient for the poly/sigmoid kernel type. Defaults to 0.
- coef_const : float, optional
Coefficient for the poly/sigmoid kernel type. Defaults to 0.
- probability : bool, optional
If true, output probability during prediction. Defaults to False.
- shrink : bool, optional
If true, use shrink strategy. Defaults to True.
- tol : float, optional
Specifies the error tolerance in the training process. Value range: > 0. Defaults to 0.001.
- evaluation_seed : int, optional
The random seed in parameter selection. Value range: >= 0. Defaults to 0.
- thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.0.
- scale_info : {‘no’, ‘standardization’, ‘rescale’}, optional
Options:
- ‘no’ : No scale.
- ‘standardization’ : Transforms the data to have zero mean and unit variance.
- ‘rescale’ : Rescales the range of the features to scale the range in [-1,1].
Defaults to standardization.
- categorical_variable : list of str, optional
Column names in the data table used as category variable.
- category_weight : float, optional
Represents the weight of category attributes. Value range: > 0. Defaults to 0.707.
Examples
Training data:
>>> df_fit.head(10).collect() ID ATTRIBUTE1 ATTRIBUTE2 ATTRIBUTE3 ATTRIBUTE4 LABEL 0 0 1.0 10.0 100.0 A 1 1 1 1.1 10.1 100.0 A 1 2 2 1.2 10.2 100.0 A 1 3 3 1.3 10.4 100.0 A 1 4 4 1.2 10.3 100.0 AB 1 5 5 4.0 40.0 400.0 AB 2 6 6 4.1 40.1 400.0 AB 2 7 7 4.2 40.2 400.0 AB 2 8 8 4.3 40.4 400.0 AB 2 9 9 4.2 40.3 400.0 AB 2
Create SVC instance and call fit:
>>> svc = svm.SVC(connection_context, gamma=0.005) >>> svc.fit(df_fit, 'ID', ['ATTRIBUTE1', 'ATTRIBUTE2', ... 'ATTRIBUTE3', 'ATTRIBUTE4']) >>> df_predict = connection_context.table("SVC_PREDICT_DATA_TBL") >>> df_predict.collect() ID ATTRIBUTE1 ATTRIBUTE2 ATTRIBUTE3 ATTRIBUTE4 0 0 1.0 10.0 100.0 A 1 1 1.2 10.2 100.0 A 2 2 4.1 40.1 400.0 AB 3 3 4.2 40.3 400.0 AB 4 4 9.1 90.1 900.0 A 5 5 9.2 90.2 900.0 A 6 6 4.0 40.0 400.0 A
Call predict:
>>> res = svc.predict(df_predict, 'ID', ['ATTRIBUTE1', 'ATTRIBUTE2', ... 'ATTRIBUTE3', 'ATTRIBUTE4']) >>> res.collect() ID SCORE PROBABILITY 0 0 1 None 1 1 1 None 2 2 2 None 3 3 2 None 4 4 3 None 5 5 3 None 6 6 2 None
Attributes: - model_ : DataFrame
Model content.
Methods
fit
(data[, key, features, label])Fit the model when given training dataset and other attributes. predict
(data, key[, features, verbose])Predict the dataset using the trained model. score
(data, key[, features, label])Returns the accuracy on the given test data and labels. -
fit
(data, key=None, features=None, label=None)¶ Fit the model when given training dataset and other attributes.
Parameters: - data : DataFrame
DataFrame containing the data.
- key : str
Name of the ID column. If key is not provided, it is assumed that the input has no ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all the non-ID, non-label columns.
- label : str, optional
Name of the label column. If label is not provided, it defaults to the last column.
-
predict
(data, key, features=None, verbose=False)¶ Predict the dataset using the trained model.
Parameters: - data : DataFrame
DataFrame containing the data.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all the non-ID, non-label columns.
- verbose : bool, optional
If true, output scoring probabilities for each class. It is only applicable when probability is true during instance creation. Defaults to False.
Returns: - DataFrame
- Predict result, structured as follows:
- ID column, with the same name and type as data’s ID column.
- SCORE, type NVARCHAR(100), prediction value.
- PROBABILITY, type DOUBLE, prediction probability. It is NULL when probability is False during instance creation.
-
score
(data, key, features=None, label=None)¶ Returns the accuracy on the given test data and labels.
Parameters: - data : DataFrame
DataFrame containing the data.
- key : str.
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all the non-ID, non-label columns.
- label : str, optional
Name of the label column. If label is not provided, it defaults to the last column.
Returns: - accuracy : float
Scalar accuracy value comparing the predicted result and original label.
-
class
hana_ml.algorithms.pal.svm.
SVR
(conn_context, c=None, kernel='rbf', degree=None, gamma=None, coef_lin=None, coef_const=None, shrink=True, tol=None, evaluation_seed=None, thread_ratio=None, scale_info=None, scale_label=None, categorical_variable=None, category_weight=None, regression_eps=None)¶ Bases:
hana_ml.algorithms.pal.svm._SVMBase
Support Vector Regression
Parameters: - conn_context : ConnectionContext
Connection to the HANA system.
- c : float, optional
Trade-off between training error and margin. Value range: > 0. Defaults to 100.0.
- kernel : {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’}, optional
Defaults to rbf.
- degree : int, optional
Coefficient for the poly kernel type. Value range: >= 1. Defaults to 3.
- gamma : float, optional
Coefficient for the rbf kernel type. Defaults to 1.0/number of features in the dataset Only valid when kernel is rbf.
- coef_lin : float, optional
Coefficient for the poly/sigmoid kernel type. Defaults to 0.
- coef_const : float, optional
Coefficient for the poly/sigmoid kernel type. Defaults to 0.
- shrink : bool, optional
If true, use shrink strategy. Defaults to True.
- tol : float, optional
Specifies the error tolerance in the training process. Value range: > 0. Defaults to 0.001.
- evaluation_seed : int, optional
The random seed in parameter selection. Value range: >= 0. Defaults to 0.
- thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.0.
- scale_info : {‘no’, ‘standardization’, ‘rescale’}, optional
Options:
- ‘no’ : No scale.
- ‘standardization’ : Transforms the data to have zero mean and unit variance.
- ‘rescale’ : Rescales the range of the features to scale the range in [-1,1].
Defaults to standardization.
- scale_label : bool, optional
If true, standardize the label for SVR. It is only applicable when the scale_info is standardization. Defaults to True.
- categorical_variable : list of str, optional
Column names in the data table used as category variable.
- category_weight : float, optional
Represents the weight of category attributes. Value range: > 0. Defaults to 0.707.
- regression_eps : float, optional
Epsilon width of tube for regression. Defaults to 0.1.
Examples
Training data:
>>> df_fit.collect() ID ATTRIBUTE1 ATTRIBUTE2 ATTRIBUTE3 ATTRIBUTE4 ATTRIBUTE5 VALUE 0 0 0.788606 0.787308 -1.301485 1.226053 -0.533385 95.626483 1 1 0.414869 -0.381038 -0.719309 1.603499 1.557837 162.582000 2 2 0.236282 -1.118764 0.233341 -0.698410 0.387380 -56.564303 3 3 -0.087779 -0.462372 -0.038412 -0.552897 1.231209 -32.241614 4 4 -0.476389 1.836772 -0.292337 -1.364599 1.326768 -143.240878 5 5 0.523326 0.065154 -1.513822 0.498921 -0.590686 -5.237827 6 6 -1.425838 -0.900437 -0.672299 0.646424 0.508856 -43.005837 7 7 -1.601836 0.455530 0.438217 -0.860707 -0.338282 -126.389824 8 8 0.266698 -0.725057 0.462189 0.868752 -1.542683 46.633594 9 9 -0.772496 -2.192955 0.822904 -1.125882 -0.946846 -175.356260 10 10 0.492364 -0.654237 -0.226986 -0.387156 -0.585063 -49.213910 11 11 0.378409 -1.544976 0.622448 -0.098902 1.437910 34.788276 12 12 0.317183 0.473067 -1.027916 0.549077 0.013483 32.845141 13 13 1.340660 -1.082651 0.730509 -0.944931 0.351025 -6.500411 14 14 0.736456 1.649251 1.334451 -0.530776 0.280830 87.451863
Create SVR instance and call fit:
>>> svr = svm.SVR(conn, kernel='linear', scale_info='standardization', ... scale_label=True) >>> svr.fit(df_fit, 'ID', ['ATTRIBUTE1', 'ATTRIBUTE2', 'ATTRIBUTE3', ... 'ATTRIBUTE4', 'ATTRIBUTE5'])
Attributes: - model_ : DataFrame
Model content.
Methods
fit
(data, key[, features, label])Fit the model when given training dataset and other attributes. predict
(data, key[, features])Predict the dataset using the trained model. score
(data, key[, features, label])Returns the coefficient of determination R^2 of the prediction. -
fit
(data, key, features=None, label=None)¶ Fit the model when given training dataset and other attributes.
Parameters: - data : DataFrame
DataFrame containing the data.
- key : str
Name of the ID column. If key is not provided, it is assumed that the input has no ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all the non-ID, non-label columns.
- label : str, optional
Name of the label column. If label is not provided, it defaults to the last column.
-
predict
(data, key, features=None)¶ Predict the dataset using the trained model.
Parameters: - data : DataFrame
DataFrame containing the data.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all the non-ID columns.
Returns: - DataFrame
- Predict result, structured as follows:
- ID column, with the same name and type as data’s ID column.
- SCORE, type NVARCHAR(100), prediction value.
- PROBABILITY, type DOUBLE, prediction probability. Always NULL. This column is only used for SVC and SVRanking.
-
score
(data, key, features=None, label=None)¶ Returns the coefficient of determination R^2 of the prediction.
Parameters: - data : DataFrame
DataFrame containing the data.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all the non-ID and non-label columns.
- label : str, optional
Name of the label column. If label is not provided, it defaults to the last column.
Returns: - accuracy : float
Returns the coefficient of determination R^2 of the prediction.
-
class
hana_ml.algorithms.pal.svm.
SVRanking
(conn_context, c=None, kernel='rbf', degree=None, gamma=None, coef_lin=None, coef_const=None, probability=False, shrink=True, tol=None, evaluation_seed=None, thread_ratio=None, scale_info=None, categorical_variable=None, category_weight=None)¶ Bases:
hana_ml.algorithms.pal.svm._SVMBase
Support Vector Ranking
Parameters: - conn_context : ConnectionContext
Connection to the HANA system.
- c : float, optional
Trade-off between training error and margin. Value range: > 0. Defaults to 100.
- kernel : {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’}, optional
Defaults to rbf.
- degree : int, optional
Coefficient for the poly kernel type. Value range: >= 1. Defaults to 3.
- gamma : float, optional
Coefficient for the rbf kernel type. Defaults to to 1.0/number of features in the dataset. Only valid when kernel is rbf.
- coef_lin : float, optional
Coefficient for the poly/sigmoid kernel type. Defaults to 0.
- coef_const : float, optional
Coefficient for the poly/sigmoid kernel type. Defaults to 0.
- probability : bool, optional
If true, output probability during prediction. Defaults to False.
- shrink : bool, optional
If true, use shrink strategy. Defaults to True.
- tol : float, optional
Specifies the error tolerance in the training process. Value range: > 0. Defaults to 0.001.
- evaluation_seed : int, optional
The random seed in parameter selection. Value range: >= 0. Defaults to 0.
- thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.0.
- scale_info : {‘no’, ‘standardization’, ‘rescale’}, optional
Options:
- ‘no’ : No scale.
- ‘standardization’ : Transforms the data to have zero mean and unit variance.
- ‘rescale’ : Rescales the range of the features to scale the range in [-1,1].
Defaults to standardization.
- categorical_variable : list of str, optional
Column names in the data table used as category variable.
- category_weight : float, optional
Represents the weight of category attributes. Value range: > 0. Defaults to 0.707.
Notes
PAL will throw an error if probability=True is provided to the SVRanking constructor and verbose=True is not provided to predict(). This is a known bug.
Examples
Training data:
>>> df_fit.head(10).collect() ID ATTRIBUTE1 ATTRIBUTE2 ATTRIBUTE3 ATTRIBUTE4 ATTRIBUTE5 QID LABEL 0 0 1.0 1.0 0.0 0.2 0.0 qid:1 3 1 1 0.0 0.0 1.0 0.1 1.0 qid:1 2 2 2 0.0 0.0 1.0 0.3 0.0 qid:1 1 3 3 2.0 1.0 1.0 0.2 0.0 qid:1 4 4 4 3.0 1.0 1.0 0.4 1.0 qid:1 5 5 5 4.0 1.0 1.0 0.7 0.0 qid:1 6 6 6 0.0 0.0 1.0 0.2 0.0 qid:2 1 7 7 1.0 0.0 1.0 0.4 0.0 qid:2 2 8 8 0.0 0.0 1.0 0.2 0.0 qid:2 1 9 9 1.0 1.0 1.0 0.2 0.0 qid:2 3
Create SVRanking instance and call fit:
>>> svranking = svm.SVRanking(conn, gamma=0.005) >>> features = ['ATTRIBUTE1', 'ATTRIBUTE2', 'ATTRIBUTE3', 'ATTRIBUTE4', ... 'ATTRIBUTE5'] >>> svranking.fit(df_fit, 'ID', 'QID', features, 'LABEL')
Call predict:
>>> df_predict = conn.table("DATA_TBL_SVRANKING_PREDICT") >>> df_predict.head(10).collect() ID ATTRIBUTE1 ATTRIBUTE2 ATTRIBUTE3 ATTRIBUTE4 ATTRIBUTE5 QID 0 0 1.0 1.0 0.0 0.2 0.0 qid:1 1 1 0.0 0.0 1.0 0.1 1.0 qid:1 2 2 0.0 0.0 1.0 0.3 0.0 qid:1 3 3 2.0 1.0 1.0 0.2 0.0 qid:1 4 4 3.0 1.0 1.0 0.4 1.0 qid:1 5 5 4.0 1.0 1.0 0.7 0.0 qid:1 6 6 0.0 0.0 1.0 0.2 0.0 qid:4 7 7 1.0 0.0 1.0 0.4 0.0 qid:4 8 8 0.0 0.0 1.0 0.2 0.0 qid:4 9 9 1.0 1.0 1.0 0.2 0.0 qid:4 >>> svranking.predict(df_predict, key='ID', ... features=features, qid='QID').head(10).collect() ID SCORE PROBABILITY 0 0 -9.85138 None 1 1 -10.8657 None 2 2 -11.6741 None 3 3 -9.33985 None 4 4 -7.88839 None 5 5 -6.8842 None 6 6 -11.7081 None 7 7 -10.8003 None 8 8 -11.7081 None 9 9 -10.2583 None
Attributes: - model_ : DataFrame
Model content.
Methods
fit
(data, key, qid[, features, label])Fit the model when given training dataset and other attributes. predict
(data, key, qid[, features, verbose])Predict the dataset using the trained model. -
fit
(data, key, qid, features=None, label=None)¶ Fit the model when given training dataset and other attributes.
Parameters: - data : DataFrame
DataFrame containing the data.
- key : str
Name of the ID column. If key is not provided, it is assumed that the input has no ID column.
- qid : str
Name of the qid column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all the non-ID, non-label, non-qid columns.
- label : str, optional
Name of the label column. If label is not provided, it defaults to the last column.
-
predict
(data, key, qid, features=None, verbose=False)¶ Predict the dataset using the trained model.
Parameters: - data : DataFrame
DataFrame containing the data.
- key : str
Name of the ID column.
- qid : str
Name of the qid column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all the non-ID, non-qid columns.
- verbose : bool, optional
If true, output scoring probabilities for each class. Defaults to False.
Returns: - DataFrame
- Predict result, structured as follows:
- ID column, with the same name and type as data’s ID column.
- Score, type NVARCHAR(100), prediction value.
- PROBABILITY, type DOUBLE, prediction probability. It is NULL when probability is False during instance creation.
hana_ml.algorithms.pal.trees¶
This module includes decision tree-based models for classification and regression.
The following classes are available:
-
class
hana_ml.algorithms.pal.trees.
DecisionTreeClassifier
(conn_context, algorithm, thread_ratio=None, allow_missing_dependent=True, percentage=None, min_records_of_parent=None, min_records_of_leaf=None, max_depth=None, categorical_variable=None, split_threshold=None, discretization_type=None, bins=None, max_branch=None, merge_threshold=None, use_surrogate=None, model_format=None, output_rules=True, priors=None, output_confusion_matrix=True)¶ Bases:
hana_ml.algorithms.pal.trees._DecisionTreeBase
Decision Tree model for classification.
Parameters: - conn_context : ConnectionContext
Database connection object.
- algorithm : {‘c45’, ‘chaid’, ‘cart’}
- Algorithm used to grow a decision tree. Case-insensitive.
- ‘c45’: C4.5 algorithm.
- ‘chaid’: Chi-square automatic interaction detection.
- ‘cart’: Classification and regression tree.
- thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.
- allow_missing_dependent : bool, optional
- Specifies if a missing target value is allowed.
- False: Not allowed. An error occurs if a missing target is present.
- True: Allowed. The datum with the missing target is removed.
Defaults to True.
- percentage : float, optional
Specifies the percentage of the input data that will be used to build the tree model. The rest of the data will be used for pruning. Defaults to 1.0.
- min_records_of_parent : int, optional
Specifies the stop condition: if the number of records in one node is less than the specified value, the algorithm stops splitting. Defaults to 2.
- min_records_of_leaf : int, optional
Promises the minimum number of records in a leaf. Defaults to 1.
- max_depth : int, optional
The maximum depth of a tree. By default it is unlimited.
- categorical_variable : list of str, optional
Indicates features should be treated as categorical. The default behavior is: string: categorical, or integer and float: continuous. VALID only for integer variables; omitted otherwise. Default value detected from input data.
- split_threshold : float, optional
- Specifies the stop condition for a node:
- C45: The information gain ratio of the best split is less than this value.
- CHAID: The p-value of the best split is greater than or equal to this value.
- CART: The reduction of Gini index or relative MSE of the best split is less than this value.
The smaller the SPLIT_THRESHOLD value is, the larger a C45 or CART tree grows. On the contrary, CHAID will grow a larger tree with larger SPLIT_THRESHOLD value. Defaults to 1e-5 for C45 and CART, 0.05 for CHAID.
- discretization_type : {‘mdlpc’, ‘equal_freq’}, optional
- Strategy for discretizing continuous attributes. Case-insensitive.
- ‘mdlpc’: Minimum description length principle criterion.
- ‘equal_freq’: Equal frequency discretization.
Valid only for C45 and CHAID. Defaults to mdlpc.
- bins : List of tuples: (column name, number of bins), optional
Specifies the number of bins for discretization. Only valid when discretizaition type is equal_freq. Defaults to 10 for each column.
- max_branch : int, optional
Specifies the maximum number of branches. Valid only for CHAID. Defaults to 10.
- merge_threshold : float, optional
Specifies the merge condition for CHAID: if the metric value is greater than or equal to the specified value, the algorithm will merge two branches. Only valid for CHAID. Defaults to 0.05.
- use_surrogate : bool, optional
If true, use surrogate split when NULL values are encountered. Only valid for CART. Defaults to True.
- model_format : {‘json’, ‘pmml’}, optional
- Specifies the tree model format for store. Case-insensitive.
- ‘json’: export model in json format.
- ‘pmml’: export model in pmml format.
Defaults to json.
- output_rules : bool, optional
If true, output decision rules. Defaults to True.
- priors : List of tuples: (class, prior_prob), optional
Specifies the prior probability of every class label. Default value detected from data.
- output_confusion_matrix : bool, optional
If true, output the confusion matrix. Defaults to True.
Examples
Input dataframe for training:
>>> df1.head(4).collect() OUTLOOK TEMP HUMIDITY WINDY CLASS 0 Sunny 75 70.0 Yes Play 1 Sunny 80 90.0 Yes Do not Play 2 Sunny 85 85.0 No Do not Play 3 Sunny 72 95.0 No Do not Play
Creating DecisionTreeClassifier instance:
>>> dtc = DecisionTreeClassifier(conn_context=cc, algorithm='c45', ... min_records_of_parent=2, ... min_records_of_leaf=1, ... thread_ratio=0.4, split_threshold=1e-5, ... model_format='json', output_rules=True)
Performing fit() on given dataframe:
>>> dtc.fit(df1, features=['OUTLOOK', 'TEMP', 'HUMIDITY', 'WINDY'], ... label='LABEL') >>> dtc.decision_rules_.collect() ROW_INDEX RULES_CONTENT 0 0 (TEMP>=84) => Do not Play 1 1 (TEMP<84) && (OUTLOOK=Overcast) => Play 2 2 (TEMP<84) && (OUTLOOK=Sunny) && (HUMIDITY<82.5) => Play 3 3 (TEMP<84) && (OUTLOOK=Sunny) && (HUMIDITY>=82.5) => Do not Play 4 4 (TEMP<84) && (OUTLOOK=Rain) && (WINDY=Yes) => Do not Play 5 5 (TEMP<84) && (OUTLOOK=Rain) && (WINDY=No) => Play
Input dataframe for predicting:
>>> df2.collect() ID OUTLOOK HUMIDITY TEMP WINDY 0 0 Overcast 75.0 70 Yes 1 1 Rain 78.0 70 Yes 2 2 Sunny 66.0 70 Yes 3 3 Sunny 69.0 70 Yes 4 4 Rain NaN 70 Yes 5 5 None 70.0 70 Yes 6 6 *** 70.0 70 Yes
Performing predict() on given dataframe:
>>> result = dtc.predict(df2, key='ID', verbose=False) >>> result.collect() ID SCORE CONFIDENCE 0 0 Play 1.000000 1 1 Do not Play 1.000000 2 2 Play 1.000000 3 3 Play 1.000000 4 4 Do not Play 1.000000 5 5 Play 0.692308 6 6 Play 0.692308
Input dataframe for scoring:
>>> df3.collect() ID OUTLOOK HUMIDITY TEMP WINDY LABEL 0 0 Overcast 75.0 70 Yes Play 1 1 Rain 78.0 70 No Do not Play 2 2 Sunny 66.0 70 Yes Play 3 3 Sunny 69.0 70 Yes Play
Performing score() on given dataframe:
>>> rfc.score(df3, key='ID') 0.75
Attributes: - model_ : DataFrame
Trained model content.
- decision_rules_ : DataFrame
Rules for decision tree to make decisions. Set to None if output_rules is False.
- confusion_matrix_ : DataFrame
Confusion matrix used to evaluate the performance of classification algorithms. Set to None if output_confusion_matrix is False.
Methods
fit
(data[, key, features, label])Function for building a decision tree classifier. predict
(data, key[, features, verbose])Predict dependent variable values based on fitted model. score
(data, key[, features, label])Returns the mean accuracy on the given test data and labels. -
fit
(data, key=None, features=None, label=None)¶ Function for building a decision tree classifier.
Parameters: - data : DataFrame
Training data.
- key : str, optional
Name of the ID column. If key is not provided, it is assumed that the input has no ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.
- label : str, optional
Name of the dependent variable. Defaults to the last column.
-
predict
(data, key, features=None, verbose=False)¶ Predict dependent variable values based on fitted model.
Parameters: - data : DataFrame
Independent variable values to predict for.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all non-ID columns.
- verbose : bool, optional
If true, output all classes and the corresponding confidences for each data point. This parameter is valid only for classification. Defaults to False.
Returns: - DataFrame
- DataFrame of score and confidence, structured as follows:
- ID column, with same name and type as data’s ID column.
- SCORE, type DOUBLE, representing the predicted classes/values.
- CONFIDENCE, type DOUBLE, representing the confidence of a class, all 0s if for regression.
-
score
(data, key, features=None, label=None)¶ Returns the mean accuracy on the given test data and labels.
Parameters: - data : DataFrame
Data on which to assess model performance.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.
- label : str, optional
Name of the dependent variable. Defaults to the last column.
Returns: - float
Mean accuracy on the given test data and labels.
-
class
hana_ml.algorithms.pal.trees.
DecisionTreeRegressor
(conn_context, algorithm, thread_ratio=None, allow_missing_dependent=True, percentage=None, min_records_of_parent=None, min_records_of_leaf=None, max_depth=None, categorical_variable=None, split_threshold=None, use_surrogate=None, model_format=None, output_rules=True)¶ Bases:
hana_ml.algorithms.pal.trees._DecisionTreeBase
Decision Tree model for regression.
Parameters: - conn_context : ConnectionContext
Database connection object.
- algorithm : {‘cart’}
- Algorithm used to grow a decision tree.
- ‘cart’: Classification and Regression tree.
Currently supports cart.
- thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to 0.
- allow_missing_dependent : bool, optional
- Specifies if a missing target value is allowed.
- False: Not allowed. An error occurs if a missing target is present.
- True: Allowed. The datum with the missing target is removed.
Defaults to True.
- percentage : float, optional
Specifies the percentage of the input data that will be used to build the tree model. The rest of the data will be used for pruning. Defaults to 1.0.
- min_records_of_parent : int, optional
Specifies the stop condition: if the number of records in one node is less than the specified value, the algorithm stops splitting. Defaults to 2.
- min_records_of_leaf : int, optional
Promises the minimum number of records in a leaf. Defaults to 1.
- max_depth : int, optional
The maximum depth of a tree. By default it is unlimited.
- categorical_variable : list of str, optional
Indicates features should be treated as categorical. The default behavior is: string: categorical, or integer and float: continuous. VALID only for integer variables; omitted otherwise. Default value detected from input data.
- split_threshold : float, optional
- Specifies the stop condition for a node:
- CART: The reduction of Gini index or relative MSE of the best split is less than this value.
The smaller the SPLIT_THRESHOLD value is, the larger a CART tree grows. Defaults to 1e-5 for CART.
- use_surrogate : bool, optional
If true, use surrogate split when NULL values are encountered. Only valid for cart. Defaults to True.
- model_format : {‘json’, ‘pmml’}, optional
- Specifies the tree model format for store. Case-insensitive.
- ‘json’: export model in json format.
- ‘pmml’: export model in pmml format.
Defaults to json.
- output_rules : bool, optional
If true, output decision rules. Defaults to True.
Examples
Input dataframe for training:
>>> df1.head(5).collect() ID A B C D CLASS 0 0 1.764052 0.400157 0.978738 2.240893 49.822907 1 1 1.867558 -0.977278 0.950088 -0.151357 4.877286 2 2 -0.103219 0.410598 0.144044 1.454274 11.914875 3 3 0.761038 0.121675 0.443863 0.333674 19.753078 4 4 1.494079 -0.205158 0.313068 -0.854096 23.607000
Creating DecisionTreeRegressor instance:
>>> dtr = DecisionTreeRegressor(conn_context=cc, algorithm='cart', ... min_records_of_parent=2, min_records_of_leaf=1, ... thread_ratio=0.4, split_threshold=1e-5, ... model_format='pmml', output_rules=True)
Performing fit() on given dataframe:
>>> dtr.fit(df1, key='ID') >>> dtr.decision_rules_.head(2).collect() ROW_INDEX RULES_CONTENT 0 0 (A<-0.495502) && (B<-0.663588) => -85.8762 1 1 (A<-0.495502) && (B>=-0.663588) => -29.9827
Input dataframe for predicting:
>>> df2.collect() ID A B C D 0 0 1.764052 0.400157 0.978738 2.240893 1 1 1.867558 -0.977278 0.950088 -0.151357 2 2 -0.103219 0.410598 0.144044 1.454274 3 3 0.761038 0.121675 0.443863 0.333674 4 4 1.494079 -0.205158 0.313068 -0.854096
Performing predict() on given dataframe:
>>> result = dtr.predict(df2, key='ID') >>> result.collect() ID SCORE CONFIDENCE 0 0 49.8229 0.0 1 1 4.87728 0.0 2 2 11.9148 0.0 3 3 19.753 0.0 4 4 23.607 0.0
Input dataframe for scoring:
>>> df3.collect() ID A B C D CLASS 0 0 1.764052 0.400157 0.978738 2.240893 49.822907 1 1 1.867558 -0.977278 0.950088 -0.151357 4.877286 2 2 -0.103219 0.410598 0.144044 1.454274 11.914875 3 3 0.761038 0.121675 0.443863 0.333674 19.753078 4 4 1.494079 -0.205158 0.313068 -0.854096 23.607000
Performing score() on given dataframe:
>>> dtr.score(df3, key='ID') 0.9999999999900131
Attributes: - model_ : DataFrame
Trained model content.
- decision_rules_ : DataFrame
Rules for decision tree to make decisions. Set to None if output_rules is False.
Methods
fit
(data[, key, features, label])Train the model on input data. predict
(data, key[, features, verbose])Predict dependent variable values based on fitted model. score
(data, key[, features, label])Returns the coefficient of determination R^2 of the prediction. -
fit
(data, key=None, features=None, label=None)¶ Train the model on input data.
Parameters: - data : DataFrame
Training data.
- key : str, optional
Name of the ID column. If key is not provided, it is assumed that the input has no ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.
- label : str, optional
Name of the dependent variable. Defaults to the last column.
-
predict
(data, key, features=None, verbose=False)¶ Predict dependent variable values based on fitted model.
Parameters: - data : DataFrame
Independent variable values to predict for.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all non-ID columns.
- verbose : bool, optional
If true, output all classes and the corresponding confidences for each data point. This parameter is valid only for classification. Defaults to False.
Returns: - DataFrame
- DataFrame of score and confidence, structured as follows:
- ID column, with same name and type as data’s ID column.
- SCORE, type DOUBLE, representing the predicted classes/values.
- CONFIDENCE, type DOUBLE, representing the confidence of a class, all 0s if for regression.
-
score
(data, key, features=None, label=None)¶ Returns the coefficient of determination R^2 of the prediction.
Parameters: - data : DataFrame
Data on which to assess model performance.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.
- label : str, optional
Name of the dependent variable. Defaults to the last column.
Returns: - float
The coefficient of determination R^2 of the prediction on the given data.
-
class
hana_ml.algorithms.pal.trees.
RandomForestClassifier
(conn_context, n_estimators=100, max_features=None, max_depth=None, min_samples_leaf=1, split_threshold=None, calculate_oob=True, random_state=None, thread_ratio=None, allow_missing_dependent=True, categorical_variable=None, sample_fraction=None, strata=None, priors=None)¶ Bases:
hana_ml.algorithms.pal.trees._RandomForestBase
Random forest model for classification.
Parameters: - conn_context : ConnectionContext
Connection to the HANA system.
- n_estimators : int, optional
Specifies the number of trees in the random forest. Defaults to 100.
- max_features : int, optional
Specifies the number of randomly selected splitting variables. Should not be larger than the number of input features. Defaults to sqrt(p) (for classification) or p/3(for regression), where p is the number of input features.
- max_depth : int, optional
The maximum depth of a tree. By default it is unlimited.
- min_samples_leaf : int, optional
Specifies the minimum number of records in a leaf. Defaults to 1 for classification.
- split_threshold : float, optional
Specifies the stop condition: if the improvement value of the best split is less than this value, the tree stops growing. Defaults to 1e-5.
- calculate_oob : bool, optional
If true, calculate the out-of-bag error. Defaults to True.
- random_state : int, optional
Specifies the seed for random number generator. 0: Uses the current time (in seconds) as the seed. Others: Uses the specified value as the seed. Defaults to 0.
- thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to heuristically determined.
- allow_missing_dependent : bool, optional
- Specifies if a missing target value is allowed.
- False: Not allowed. An error occurs if a missing target is present.
- True: Allowed. The datum with the missing target is removed.
Defaults to True.
- categorical_variable : list of str, optional
Indicates features should be treated as categorical. The default behavior is: string: categorical, or integer and float: continuous. VALID only for integer variables; omitted otherwise. Default value detected from input data.
- sample_fraction : float, optional
The fraction of data used for training. Assume there are n pieces of data, sample fraction is r, then n*r data is selected for training. Defaults to 1.0.
- strata : List of tuples: (class, fraction), optional
Strata proportions for stratified sampling. A (class, fraction) tuple specifies that rows with that class should make up the specified fraction of each sample. If the given fractions do not add up to 1, the remaining portion is divided equally between classes with no entry in strata, or between all classes if all classes have an entry in strata. If strata is not provided, bagging is used instead of stratified sampling.
- priors : List of tuples: (class, prior_prob), optional
Prior probabilities for classes. A (class, prior_prob) tuple specifies the prior probability of this class. If the given priors do not add up to 1, the remaining portion is divided equally between classes with no entry in priors, or between all classes if all classes have an entry in ‘priors’. If priors is not provided, it is determined by the proportion of every class in the training data.
Examples
Input dataframe for training:
>>> df1.head(4).collect() OUTLOOK TEMP HUMIDITY WINDY LABEL 0 Sunny 75.0 70.0 Yes Play 1 Sunny NaN 90.0 Yes Do not Play 2 Sunny 85.0 NaN No Do not Play 3 Sunny 72.0 95.0 No Do not Play
Creating RandomForestClassifier instance:
>>> rfc = RandomForestClassifier(conn_context=cc, n_estimators=3, ... max_features=3, random_state=2, ... split_threshold=0.00001, ... calculate_oob=True, ... min_samples_leaf=1, thread_ratio=1.0)
Performing fit() on given dataframe:
>>> rfc.fit(df1, features=['OUTLOOK', 'TEMP', 'HUMIDITY', 'WINDY'], ... label='LABEL') >>> rfc.feature_importances_.collect() VARIABLE_NAME IMPORTANCE 0 OUTLOOK 0.449550 1 TEMP 0.216216 2 HUMIDITY 0.208108 3 WINDY 0.126126
Input dataframe for predicting:
>>> df2.collect() ID OUTLOOK TEMP HUMIDITY WINDY 0 0 Overcast 75.0 -10000.0 Yes 1 1 Rain 78.0 70.0 Yes
Performing predict() on given dataframe:
>>> result = rfc.predict(df2, key='ID', verbose=False) >>> result.collect() ID SCORE CONFIDENCE 0 0 Play 0.666667 1 1 Play 0.666667
Input dataframe for scoring:
>>> df3.collect() ID OUTLOOK TEMP HUMIDITY WINDY LABEL 0 0 Sunny 70 90.0 Yes Play 1 1 Overcast 81 90.0 Yes Play 2 2 Rain 65 80.0 No Play
Performing score() on given dataframe:
>>> rfc.score(df3, key='ID') 0.6666666666666666
Attributes: - model_ : DataFrame
Trained model content.
- feature_importances_ : DataFrame
The feature importance (the higher, the more important the feature).
- oob_error_ : DataFrame
Out-of-bag error rate or mean squared error for random forest up to indexed tree. Set to None if calculate_oob is False.
- confusion_matrix_ : DataFrame
Confusion matrix used to evaluate the performance of classification algorithms.
Methods
fit
(data[, key, features, label])Train the model on input data. predict
(data, key[, features, verbose, …])Predict dependent variable values based on fitted model. score
(data, key[, features, label, …])Returns the mean accuracy on the given test data and labels. -
fit
(data, key=None, features=None, label=None)¶ Train the model on input data.
Parameters: - data : DataFrame
Training data.
- key : str, optional
Name of the ID column. If key is not provided, it is assumed that the input has no ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.
- label : str, optional
Name of the dependent variable. Defaults to the last column.
-
predict
(data, key, features=None, verbose=None, block_size=None, missing_replacement=None)¶ Predict dependent variable values based on fitted model.
Parameters: - data : DataFrame
Independent variable values to predict for.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all non-ID columns.
- block_size : int, optional
The number of rows loaded per time during prediction. 0 indicates load all data at once. Defaults to 0.
- missing_replacement : str, optional
- The missing replacement strategy:
- ‘feature_marginalized’: marginalise each missing feature out independently.
- ‘instance_marginalized’: marginalise all missing features in an instance as a whole corresponding to each category.
Defaults to feature_marginalized.
- verbose : bool, optional
If true, output all classes and the corresponding confidences for each data point.
Returns: - DataFrame
- DataFrame of score and confidence, structured as follows:
- ID column, with same name and type as data’s ID column.
- SCORE, type DOUBLE, representing the predicted classes.
- CONFIDENCE, type DOUBLE, representing the confidence of a class.
-
score
(data, key, features=None, label=None, block_size=None, missing_replacement=None)¶ Returns the mean accuracy on the given test data and labels.
Parameters: - data : DataFrame
Data on which to assess model performance.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.
- label : str, optional
Name of the dependent variable. Defaults to the last column.
- block_size : int, optional
The number of rows loaded per time during prediction. 0 indicates load all data at once. Defaults to 0.
- missing_replacement : str, optional
- The missing replacement strategy:
- ‘feature_marginalized’: marginalise each missing feature out independently.
- ‘instance_marginalized’: marginalise all missing features in an instance as a whole corresponding to each category.
Defaults to feature_marginalized.
Returns: - float
Mean accuracy on the given test data and labels.
-
class
hana_ml.algorithms.pal.trees.
RandomForestRegressor
(conn_context, n_estimators=100, max_features=None, max_depth=None, min_samples_leaf=None, split_threshold=None, calculate_oob=True, random_state=None, thread_ratio=None, allow_missing_dependent=True, categorical_variable=None, sample_fraction=None)¶ Bases:
hana_ml.algorithms.pal.trees._RandomForestBase
Random forest model for regression.
Parameters: - conn_context : ConnectionContext
Connection to the HANA system.
- n_estimators : int, optional
Specifies the number of trees in the random forest. Defaults to 100.
- max_features : int, optional
Specifies the number of randomly selected splitting variables. Should not be larger than the number of input features. Defaults to sqrt(p) (for classification) or p/3(for regression), where p is the number of input features.
- max_depth : int, optional
The maximum depth of a tree. By default it is unlimited.
- min_samples_leaf : int, optional
Specifies the minimum number of records in a leaf. Defaults to 5 for regression.
- split_threshold : float, optional
Specifies the stop condition: if the improvement value of the best split is less than this value, the tree stops growing. Defaults to 1e-5.
- calculate_oob : bool, optional
If true, calculate the out-of-bag error. Defaults to True.
- random_state : int, optional
Specifies the seed for random number generator. 0: Uses the current time (in seconds) as the seed. Others: Uses the specified value as the seed. Defaults to 0.
- thread_ratio : float, optional
Controls the proportion of available threads to use. The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads. Values between 0 and 1 will use up to that percentage of available threads. Values outside this range tell PAL to heuristically determine the number of threads to use. Defaults to heuristically determined.
- allow_missing_dependent : bool, optional
- Specifies if a missing target value is allowed.
- False: Not allowed. An error occurs if a missing target is present.
- True: Allowed. The datum with a missing target is removed.
Defaults to True.
- categorical_variable : list of str, optional
Indicates features should be treated as categorical. The default behavior is: string: categorical, or integer and float: continuous. VALID only for integer variables; omitted otherwise. Default value detected from input data.
- sample_fraction : float, optional
The fraction of data used for training. Assume there are n pieces of data, sample fraction is r, then n*r data is selected for training. Defaults to 1.0.
Examples
Input dataframe for training:
>>> df1.head(5).collect() ID A B C D CLASS 0 0 -0.965679 1.142985 -0.019274 -1.598807 -23.633813 1 1 2.249528 1.459918 0.153440 -0.526423 212.532559 2 2 -0.631494 1.484386 -0.335236 0.354313 26.342585 3 3 -0.967266 1.131867 -0.684957 -1.397419 -62.563666 4 4 -1.175179 -0.253179 -0.775074 0.996815 -115.534935
Creating RandomForestRegressor instance:
>>> rfr = RandomForestRegressor(conn_context=cc, random_state=3)
Performing fit() on given dataframe:
>>> rfr.fit(df1, key='ID') >>> rfr.feature_importances_.collect() VARIABLE_NAME IMPORTANCE 0 A 0.249593 1 B 0.381879 2 C 0.291403 3 D 0.077125
Input dataframe for predicting:
>>> df2.collect() ID A B C D 0 0 1.081277 0.204114 1.220580 -0.750665 1 1 0.524813 -0.012192 -0.418597 2.946886
Performing predict() on given dataframe:
>>> result = rfr.predict(df2, key='ID') >>> result.collect() ID SCORE CONFIDENCE 0 0 48.126 62.952884 1 1 -10.9017 73.461039
Input dataframe for scoring:
>>> df3.head(5).collect() ID A B C D CLASS 0 0 1.081277 0.204114 1.220580 -0.750665 139.10170 1 1 0.524813 -0.012192 -0.418597 2.946886 52.17203 2 2 -0.280871 0.100554 -0.343715 -0.118843 -34.69829 3 3 -0.113992 -0.045573 0.957154 0.090350 51.93602 4 4 0.287476 1.266895 0.466325 -0.432323 106.63425
Performing score() on given dataframe:
>>> rfr.score(df3, key='ID') 0.6530768858159514
Attributes: - model_ : DataFrame
Trained model content.
- feature_importances_ : DataFrame
The feature importance (the higher, the more important the feature).
- oob_error_ : DataFrame
Out-of-bag error rate or mean squared error for random forest up to indexed tree. Set to None if calculate_oob is False.
Methods
fit
(data[, key, features, label])Train the model on input data. predict
(data, key[, features, block_size, …])Predict dependent variable values based on fitted model. score
(data, key[, features, label, …])Returns the coefficient of determination R^2 of the prediction. -
fit
(data, key=None, features=None, label=None)¶ Train the model on input data.
Parameters: - data : DataFrame
Training data.
- key : str, optional
Name of the ID column. If key is not provided, it is assumed that the input has no ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.
- label : str, optional
Name of the dependent variable. Defaults to the last column.
-
predict
(data, key, features=None, block_size=None, missing_replacement=None)¶ Predict dependent variable values based on fitted model.
Parameters: - data : DataFrame
Independent variable values to predict for.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all non-ID columns.
- block_size : int, optional
The number of rows loaded per time during prediction. 0 indicates load all data at once. Defaults to 0.
- missing_replacement : str, optional
- The missing replacement strategy:
- ‘feature_marginalized’: marginalise each missing feature out independently.
- ‘instance_marginalized’: marginalise all missing features in an instance as a whole corresponding to each category.
Defaults to feature_marginalized.
Returns: - DataFrame
- DataFrame of score and confidence, structured as follows:
- ID column, with same name and type as data’s ID column.
- SCORE, type DOUBLE, representing the predicted values.
- CONFIDENCE, all 0s. It is included due to the fact PAL uses the same table for classification.
-
score
(data, key, features=None, label=None, block_size=None, missing_replacement=None)¶ Returns the coefficient of determination R^2 of the prediction.
Parameters: - data : DataFrame
Data on which to assess model performance.
- key : str
Name of the ID column.
- features : list of str, optional
Names of the feature columns. If features is not provided, it defaults to all non-ID, non-label columns.
- label : str, optional
Name of the dependent variable. Defaults to the last column.
- block_size : int, optional
The number of rows loaded per time during prediction. 0 indicates load all data at once. Defaults to 0.
- missing_replacement : str, optional
- The missing replacement strategy:
- ‘feature_marginalized’: marginalise each missing feature out independently.
- ‘instance_marginalized’: marginalise all missing features in an instance as a whole corresponding to each category.
Defaults to feature_marginalized.
Returns: - float
The coefficient of determination R^2 of the prediction on the given data.