Survival Analysis

In clinical trials or community trials, the effect of an intervention is assessed by measuring the number of subjects who have survived or are saved after that intervention over a period of time. We wish to measure the survival probability of Dukes’C colorectal cancer patients after treatment and evaluate statistically whether the patients who accept treatment can survive longer than those who are only controlled conservatively.

1. Setup the Connection to SAP HANA

Create a jdbc or odbc connection called conn to access a SAP HANA. You could modified the parameters for the connection string in the following line, or read the information from a txt file (in our case, we store the connection information in config.R)

library(hana.ml.r)
# # jdbc connection
# > conn <- hanaml.ConnectionContext(dsn = <host>:<port>,
#                                    username = 'xxxx',
#                                    password = 'xxxx',
#                                    odbc = FALSE,
#                                    jdbcDriverPath = <path to jdbcDriver>, 
#                                    ...)
#                                    
# # OR odbc connection
# > conn <- hanaml.ConnectionContext(dsn = <ODBC data source name>,
#                                    username = 'xxxx',
#                                    password = 'xxxx',
#                                    odbc = TRUE,
#                                    ...)
source("config.R")

## WARN [2023-03-10 17:05:02] The HANA version is 4.50.000.00.1663300048 and note that some classes, functions and their parameters are different between HANA versions and please refer to the documentation of the specific function and class!

Connection status:

conn$connection

## <JDBCConnection>

2. Dataset

This scenarios describes a clinical trial of 49 patients for the treatment of Dukes’C colorectal cancer. The following data shows the survival time in 49 patients with Dukes’C colorectal cancer who are randomly assigned to either linoleic acid or control treatment.

The + sign indicates censored data. Until 6 months after treatment, there are no deaths. The effect of the censoring is to remove from the alive group those that are censored. At time 6 months two subjects have been censored so the number alive just before 6 months is 23. There are two deaths at 6 months. Thus,

We now reduce the number alive (“at risk”) by two. The censored event at 9 months reduces the “at risk” set to 20. At 10 months there are two deaths. So the proportion surviving is 18/20 = 0.9, and the cumulative proportion surviving is 0.913*0.90 = 0.8217.

3. Implementation Steps

Option 1: Kaplan-Meier Estimate

Technology Background

Kaplan-Meier estimate is one of the simplest way to measure the fraction of subjects living for a certain amount of time after treatment. The time starting from a defined point to the occurrence of a given event, for example death, is called as survival time.

Step 1

The trial data can then be loaded into table as follows:

sqlQueryMix(conn$connection, "DROP TABLE PAL_TRIAL_DATA_TBL")
sql <- 'CREATE COLUMN TABLE PAL_TRIAL_DATA_TBL (
        \"TIME\" INTEGER,
        \"STATUS\" INTEGER,
        \"OCCURRENCES\" INTEGER,
        \"GROUP\" VARCHAR(50));'
sqlQueryMix(conn$connection, sql)

data.list <- c(
  "1, 0, 1, 'linoleic acid'",
  "5, 0, 1, 'linoleic acid'",
  "6, 1, 1, 'linoleic acid'",
  "6, 1, 1, 'linoleic acid'",
  "9, 0, 1, 'linoleic acid'",
  "10, 1, 1, 'linoleic acid'",
  "10, 1, 1, 'linoleic acid'",
  "10, 0, 1, 'linoleic acid'",
  "12, 1, 4, 'linoleic acid'",
  "12, 0, 1, 'linoleic acid'",
  "13, 0, 1, 'linoleic acid'",
  "15, 0, 1, 'linoleic acid'",
  "16, 0, 1, 'linoleic acid'",
  "20, 0, 1, 'linoleic acid'",
  "24, 1, 1, 'linoleic acid'",
  "24, 0, 1, 'linoleic acid'",
  "27, 0, 1, 'linoleic acid'",
  "32, 1, 1, 'linoleic acid'",
  "34, 0, 1, 'linoleic acid'",
  "36, 0, 2, 'linoleic acid'",
  "44, 0, 1, 'linoleic acid'",
  "3, 0, 1, 'control'",
  "6, 1, 4, 'control'",
  "8, 1, 2, 'control'",
  "12, 1, 2, 'control'",
  "12, 0, 1, 'control'",
  "15, 0, 1, 'control'",
  "16, 0, 1, 'control'",
  "18, 0, 2, 'control'",
  "20, 1, 1, 'control'",
  "22, 0, 1, 'control'",
  "24, 1, 1, 'control'",
  "28, 0, 3, 'control'",
  "30, 0, 1, 'control'",
  "30, 1, 1, 'control'",
  "33, 0, 1, 'control'",
  "42, 1, 1, 'control'")

for (table.data in data.list) {
  sqlQueryMix(conn$connection, sprintf(
    "INSERT INTO %s VALUES (%s)", "PAL_TRIAL_DATA_TBL",
    paste(table.data, collapse = ", ")))
}

trial.df <- conn$table("PAL_TRIAL_DATA_TBL")

Look at the first 5 rows of trial.df:

print(trial.df$Head(5))

##   TIME STATUS OCCURRENCES         GROUP
## 1    1      0           1 linoleic acid
## 2    5      0           1 linoleic acid
## 3    6      1           1 linoleic acid
## 4    6      1           1 linoleic acid
## 5    9      0           1 linoleic acid

Step 2

Input customer data and use the Kaplan-Meier function to get the survival estimates and log-rank test statistics:

result <- hanaml.kaplan.meier.survival.analysis(trial.df)
print(result[[1]])

##            GROUP TIME RISK_NUMBER EVENT_NUMBER PROBABILITY         SE  CI_LOWER
## 1        control    6          23            4   0.8260870 0.07903420 0.6006104
## 2        control    8          19            2   0.7391304 0.09156054 0.5092094
## 3        control   12          17            2   0.6521739 0.09931135 0.4234786
## 4        control   20          10            1   0.5869565 0.10870510 0.3488771
## 5        control   24           8            1   0.5135870 0.11729213 0.2713181
## 6        control   30           4            1   0.3851902 0.14178453 0.1310408
## 7        control   42           1            1   0.0000000         NA        NA
## 8  linoleic acid    6          23            2   0.9130435 0.05875338 0.6949475
## 9  linoleic acid   10          20            2   0.8217391 0.08091666 0.5917337
## 10 linoleic acid   12          17            4   0.6283887 0.10476560 0.3911328
## 11 linoleic acid   24           8            1   0.5498402 0.11748198 0.2997946
## 12 linoleic acid   32           5            1   0.4398721 0.13604288 0.1794154
##     CI_UPPER
## 1  0.9309036
## 2  0.8733758
## 3  0.8084500
## 4  0.7636978
## 5  0.7115052
## 6  0.6389978
## 7         NA
## 8  0.9775157
## 9  0.9291698
## 10 0.7945786
## 11 0.7430605
## 12 0.6753069

To compare survival estimates produced from two groups, we use log-rank test. It is a hypothesis test to compare the survival distribution of two groups (some of the observations may be censored) and is used to test the null hypothesis that there is no difference between the populations (treatment group and control group) in the probability of an event (here a death) at any time point. The methods are nonparametric in that they do not make assumptions about the distributions of survival estimates. The analysis is based on the times of events (here deaths). For each such time we calculate the observed number of deaths in each group and the number expected if there were in reality no difference between the groups. It is widely used in clinical trials to establish the efficacy of a new treatment in comparison with a control treatment when the measurement is the time to event (such as the time from initial treatment to death).

Because the log-rank test is purely a test of significance, it cannot provide an estimate of the size of the difference between the groups.

print(result[[2]])

##           GROUP TOTAL_RISK OBSERVED EXPECTED LOGRANK_STAT
## 1       control         24       12 10.71468    0.3289494
## 2 linoleic acid         25       10 11.28532    0.3289494

print(result[[3]])

##   STAT_NAME STAT_VALUE
## 1    chiSqr  0.3289494
## 2        df  1.0000000
## 3   p-value  0.5662784

Option 2: Weibull Distribution

Technology Background

Weibull distribution is often used for reliability and survival analysis. It is defined by 3 parameters: shape, scale, and location. Scale works as key to magnify or shrink the curve. Shape is the crucial factor to define how the curve looks like, as described below:

Shape = 1: The failure rate is constant over time, indicating random failure.
Shape < 1: The failure rate decreases over time.
Shape > 1: The failure rate increases over time.

Step 1

Get Weibull distribution and statistics from the linoleic acid treatment data:

sqlQueryMix(conn$connection, "DROP TABLE PAL_DISTRFITCENSORED_DATA_TBL")
sql <- 'CREATE COLUMN TABLE PAL_DISTRFITCENSORED_DATA_TBL (\"LEFT\" DOUBLE, \"RIGHT\" DOUBLE);'
sqlQueryMix(conn$connection, sql)

data.list <- c("1,NULL",
              "5,NULL",
              "6,6",
              "6,6",
              "9,NULL",
              "10,10",
              "10,10",
              "10,NULL",
              "12,12",
              "12,12",
              "12,12",
              "12,12",
              "12,NULL",
              "13,NULL",
              "15,NULL",
              "16,NULL",
              "20,NULL",
              "24,24",
              "24,NULL",
              "27,NULL",
              "32,32",
              "34,NULL",
              "36,NULL",
              "36,NULL",
              "44,NULL")

for (table.data in data.list) {
  sqlQueryMix(conn$connection, sprintf(
    "INSERT INTO %s VALUES (%s)", "PAL_DISTRFITCENSORED_DATA_TBL",
    paste(table.data, collapse = ", ")))
}

linoleic.acid.df <- conn$table("PAL_DISTRFITCENSORED_DATA_TBL")

Look the first 5 rows of linoleic.acid.df:

print(linoleic.acid.df$Head(5))

##   LEFT RIGHT
## 1    1    NA
## 2    5    NA
## 3    6     6
## 4    6     6
## 5    9    NA

Call Weibull Distribution function and show the results:

result <- hanaml.DistributionFit(data = linoleic.acid.df, distr.type = "weibull",
                                optimal.method = "maximum.likelihood", censored=TRUE)
print(result)

## [[1]]
##               NAME   VALUE
## 1 DISTRIBUTIONNAME WEIBULL
## 2            SCALE 36.3069
## 3            SHAPE 1.40528
## 
## [[2]]
##       STAT_NAME STAT_VALUE
## 1 LOGLIKELIHOOD  -47.03583

Step 2

Get Weibull distribution and statistics from the control treatment data:

sqlQueryMix(conn$connection, "DROP TABLE PAL_DISTRFITCENSORED_DATA_TBL2")
sql <- 'CREATE COLUMN TABLE PAL_DISTRFITCENSORED_DATA_TBL2 (\"LEFT\" DOUBLE, \"RIGHT\" DOUBLE);'
sqlQueryMix(conn$connection, sql)

data.list <- c("3,NULL",
               "6,6",
               "6,6",
               "6,6",
               "6,6",
               "8,8",
               "8,8",
               "12,12",
               "12,12",
               "12,NULL",
               "15,NULL",
               "16,NULL",
               "18,NULL",
               "18,NULL",
               "20,20",
               "22,NULL",
               "24,24",
               "28,NULL",
               "28,NULL",
               "28,NULL",
               "30,30",
               "30,NULL",
               "33,NULL",
               "42,42")

for (table.data in data.list) {
  sqlQueryMix(conn$connection, sprintf(
    "INSERT INTO %s VALUES (%s)", "PAL_DISTRFITCENSORED_DATA_TBL2",
    paste(table.data, collapse = ", ")))
}

control.df <- conn$table("PAL_DISTRFITCENSORED_DATA_TBL2")
print(control.df)

Look the first 5 rows of control.df:

print(control.df$Head(5))

##   LEFT RIGHT
## 1    3    NA
## 2    6     6
## 3    6     6
## 4    6     6
## 5    6     6

Call Weibull Distribution function and show the results:

result <- hanaml.DistributionFit(data = control.df, distr.type = "weibull",
                                 optimal.method = "maximum.likelihood", censored=TRUE)

print(result)

## [[1]]
##               NAME   VALUE
## 1 DISTRIBUTIONNAME WEIBULL
## 2            SCALE 31.2021
## 3            SHAPE 1.43217
## 
## [[2]]
##       STAT_NAME STAT_VALUE
## 1 LOGLIKELIHOOD  -54.02222

Step 3

Get the CDF (cumulative distribution function) of Weibull distribution for the linoleic acid treatment data:

sqlQueryMix(conn$connection, "DROP TABLE PAL_DISTRPROB_DATA_TBL")
sql <- 'CREATE COLUMN TABLE PAL_DISTRPROB_DATA_TBL (\"DATACOL\" DOUBLE);'
sqlQueryMix(conn$connection, sql)

data.list <- c(6,8,12,20,24,30,42)

for (table.data in data.list) {
  sqlQueryMix(conn$connection, sprintf(
    "INSERT INTO %s VALUES (%s)", "PAL_DISTRPROB_DATA_TBL",
    paste(table.data, collapse = ", ")))
}

distr.prob.df <- conn$table("PAL_DISTRPROB_DATA_TBL")

Show the distr.prob.df:

print(distr.prob.df)

##   DATACOL
## 1       6
## 2       8
## 3      12
## 4      20
## 5      24
## 6      30
## 7      42

Invoke hanaml.CDF and show the result:

result <- hanaml.CDF(distr.prob.df, distr.info = list('Weibull', 1.40528, 36.3069), complementary=FALSE)
print(result)

##   DATACOL PROBABILITY
## 1       6  0.07657965
## 2       8  0.11251514
## 3      12  0.19024473
## 4      20  0.35118219
## 5      24  0.42818255
## 6      30  0.53457258
## 7      42  0.70687364

Step 4

Get the CDF (cumulative distribution function) of Weibull distribution for the control treatment data:

sqlQueryMix(conn$connection, "DROP TABLE PAL_DISTRPROB_DATA_TBL")
sql <- 'CREATE COLUMN TABLE PAL_DISTRPROB_DATA_TBL (\"DATACOL\" DOUBLE);'
sqlQueryMix(conn$connection, sql)

data.list <- c(6,10,12,24,32)

for (table.data in data.list) {
  sqlQueryMix(conn$connection, sprintf(
    "INSERT INTO %s VALUES (%s)", "PAL_DISTRPROB_DATA_TBL",
    paste(table.data, collapse = ", ")))
}

distr.prob.df <- conn$table("PAL_DISTRPROB_DATA_TBL")