In this end-to-end scenario, We shown how to predict the segmentation/cluster of new customers for a supermarket. Firstly, we use the K-means function to cluster existing customers of the supermarket. Then, the output can be used as the training data for the C4.5 Decision Tree function to predict new customers’ segmentation/cluster.

1. Technology Background

  1. K-means clustering is a method of cluster analysis whereby the algorithm partitions N observations or records into K clusters, in which each observation belongs to the cluster with the nearest center. It is one of the most commonly used algorithms in clustering method.

  2. Decision trees are powerful and popular tools for classification and prediction. Decision tree learning, used in statistics, data mining, and machine learning uses a decision tree as a predictive model which maps the observations about an item to the conclusions about the item’s target value.

2. Implementation Steps

Setup the Connection to SAP HANA

Create a jdbc or odbc connection called conn to access a SAP HANA. You could modified the parameters for the connection string in the following line, or read the information from a txt file (in our case, we store the connection information in config.R)

library(hana.ml.r)
# # jdbc connection
# > conn <- hanaml.ConnectionContext(dsn = <host>:<port>,
#                                    username = 'xxxx',
#                                    password = 'xxxx',
#                                    odbc = FALSE,
#                                    jdbcDriverPath = <path to jdbcDriver>, 
#                                    ...)
#                                    
# # OR odbc connection
# > conn <- hanaml.ConnectionContext(dsn = <ODBC data source name>,
#                                    username = 'xxxx',
#                                    password = 'xxxx',
#                                    odbc = TRUE,
#                                    ...)
source("config.R")
## WARN [2023-03-10 17:04:15] The HANA version is 4.50.000.00.1663300048 and note that some classes, functions and their parameters are different between HANA versions and please refer to the documentation of the specific function and class!

Connection status:

conn$connection
## <JDBCConnection>

Step 1: Invoke K-means

Input customer data and use the K-means function to partition the data set into K clusters. In this example, nine rows of data will be input. K equals 3, which means the customers will be partitioned into three levels.

Generate the customer dataset with ID, AGE, INCOME:

age <- c(20, 21, 22, 30, 31, 32, 40, 41, 42)
income <- c(100000, 101000, 102000, 200000, 201000, 202000, 400000, 401000, 402000)
data <- data.frame(ID = seq(1,9), AGE = age, INCOME = income)
data.df <- ConvertToHANADataFrame(conn, 
                                  data = data,
                                  table = "PAL_KMEANS",
                                  force = TRUE,
                                  native = TRUE)
print(data.df)
##   ID AGE INCOME
## 1  1  20 100000
## 2  2  21 101000
## 3  3  22 102000
## 4  4  30 200000
## 5  5  31 201000
## 6  6  32 202000
## 7  7  40 400000
## 8  8  41 401000
## 9  9  42 402000

Call hanaml.Kmeans to cluster the customers:

km <- hanaml.KMeans(data.df,
                    key="ID",
                    n.clusters = 3, 
                    init = "first_k",
                    distance.level = 'euclidean',
                    max.iter = 100,
                    tol = 0.000001,
                    normalization = 'no')

print(km$labels)
##   ID CLUSTER_ID DISTANCE SLIGHT_SILHOUETTE
## 1  1          0     1000         0.9900990
## 2  2          0        0         1.0000000
## 3  3          0     1000         0.9898990
## 4  4          1     1000         0.9898990
## 5  5          1        0         1.0000000
## 6  6          1     1000         0.9900990
## 7  7          2     1000         0.9949749
## 8  8          2        0         1.0000000
## 9  9          2     1000         0.9950249

Join the data.df with the result column of km$labels:

result <- km$labels$Select(list("ID", "CLUSTER_ID"))$rename.columns(list("ID_R", "CLUSTER_ID"))

# join the column CLUSTER_ID with the original data.df
data.cluster <- data.df$Join(result, on.expression="ID = ID_R")

# select the necessary columns for modeling in the next step
data.cluster <- data.cluster$Select(list('AGE', 'INCOME', 'CLUSTER_ID'))

# rename the  column 'CLUSTER_ID' to be 'LEVEL'
data.cluster <- data.cluster$rename.columns(list('AGE', 'INCOME', 'LEVEL'))
print(data.cluster)
##   AGE INCOME LEVEL
## 1  20 100000     0
## 2  21 101000     0
## 3  22 102000     0
## 4  30 200000     1
## 5  31 201000     1
## 6  32 202000     1
## 7  40 400000     2
## 8  41 401000     2
## 9  42 402000     2

Step 2: Invoke C4.5 Decision Tree

Use the above output data.cluster as the training data of C4.5 Decision Tree. The C4.5 Decision Tree function will generate a tree model which maps the observations about an item to the conclusions about the item’s target value.

# convert data type of column LEVEL to be VARCHAR(10)
data.cluster <- data.cluster$cast('LEVEL', 'VARCHAR (10)')

dt <- hanaml.DecisionTreeClassifier(data.cluster,
                                    algorithm = 'c45',
                                    percentage = 1.0,
                                    model.format = 'pmml')

# have a look at the result decision.rules
print(dt$decision.rules)
##   ROW_INDEX    RULES_CONTENT
## 1         0    (AGE<26) => 0
## 2         1 26=<AGE<36) => 1
## 3         2   (AGE>=36) => 2

Step 3 : Prediction with Tree Model

Use the above tree model to map each new customer to the corresponding level he or she belongs to:

# generate the new customer dataset
new.data <- data.frame(ID=c(10, 11, 12), AGE= c(20, 30, 40), INCOME= c(100003, 200003, 400003))
new.data.df <- ConvertToHANADataFrame(conn, 
                                      data = new.data,
                                      table = "PAL_PREDICT_TBL",
                                      force = TRUE,
                                      native = TRUE)

# convert data type of column ID to be INTERGER
new.data.df <- new.data.df$cast('ID', 'INTEGER')

# predict with decision tree object 'dt'
result <- predict(dt, new.data.df, key='ID')
print(result)
##   ID SCORE CONFIDENCE
## 1 10     0          1
## 2 11     1          1
## 3 12     2          1

3. Clear Data and Close the Connection

tbls <- list("PAL_KMEANS", "PAL_PREDICT_TBL")
for (tbl in tbls) {
  query <- sprintf("DROP TABLE %s", tbl)
  sqlQueryMix(conn$connection, query)
}
conn$close()