Scenario-Survival-Analysis.Rmd
In clinical trials or community trials, the effect of an intervention is assessed by measuring the number of subjects who have survived or are saved after that intervention over a period of time. We wish to measure the survival probability of Dukes’C colorectal cancer patients after treatment and evaluate statistically whether the patients who accept treatment can survive longer than those who are only controlled conservatively.
Create a jdbc or odbc connection called conn to access a SAP HANA. You could modified the parameters for the connection string in the following line, or read the information from a txt file (in our case, we store the connection information in config.R)
library(hana.ml.r)
# # jdbc connection
# > conn <- hanaml.ConnectionContext(dsn = <host>:<port>,
# username = 'xxxx',
# password = 'xxxx',
# odbc = FALSE,
# jdbcDriverPath = <path to jdbcDriver>,
# ...)
#
# # OR odbc connection
# > conn <- hanaml.ConnectionContext(dsn = <ODBC data source name>,
# username = 'xxxx',
# password = 'xxxx',
# odbc = TRUE,
# ...)
source("config.R")
## WARN [2024-01-11 15:22:44] The HANA version is 4.50.000.00.1704032026 and note that some classes, functions and their parameters are different between HANA versions and please refer to the documentation of the specific function and class!
Connection status:
conn$connection
## <JDBCConnection>
This scenarios describes a clinical trial of 49 patients for the treatment of Dukes’C colorectal cancer. The following data shows the survival time in 49 patients with Dukes’C colorectal cancer who are randomly assigned to either linoleic acid or control treatment.
The + sign indicates censored data. Until 6 months after treatment, there are no deaths. The effect of the censoring is to remove from the alive group those that are censored. At time 6 months two subjects have been censored so the number alive just before 6 months is 23. There are two deaths at 6 months. Thus,
We now reduce the number alive (“at risk”) by two. The censored event at 9 months reduces the “at risk” set to 20. At 10 months there are two deaths. So the proportion surviving is 18/20 = 0.9, and the cumulative proportion surviving is 0.913*0.90 = 0.8217.
Technology Background
Kaplan-Meier estimate is one of the simplest way to measure the fraction of subjects living for a certain amount of time after treatment. The time starting from a defined point to the occurrence of a given event, for example death, is called as survival time.
Step 1
The trial data can then be loaded into table as follows:
sqlQueryMix(conn$connection, "DROP TABLE PAL_TRIAL_DATA_TBL")
sql <- 'CREATE COLUMN TABLE PAL_TRIAL_DATA_TBL (
\"TIME\" INTEGER,
\"STATUS\" INTEGER,
\"OCCURRENCES\" INTEGER,
\"GROUP\" VARCHAR(50));'
sqlQueryMix(conn$connection, sql)
data.list <- c(
"1, 0, 1, 'linoleic acid'",
"5, 0, 1, 'linoleic acid'",
"6, 1, 1, 'linoleic acid'",
"6, 1, 1, 'linoleic acid'",
"9, 0, 1, 'linoleic acid'",
"10, 1, 1, 'linoleic acid'",
"10, 1, 1, 'linoleic acid'",
"10, 0, 1, 'linoleic acid'",
"12, 1, 4, 'linoleic acid'",
"12, 0, 1, 'linoleic acid'",
"13, 0, 1, 'linoleic acid'",
"15, 0, 1, 'linoleic acid'",
"16, 0, 1, 'linoleic acid'",
"20, 0, 1, 'linoleic acid'",
"24, 1, 1, 'linoleic acid'",
"24, 0, 1, 'linoleic acid'",
"27, 0, 1, 'linoleic acid'",
"32, 1, 1, 'linoleic acid'",
"34, 0, 1, 'linoleic acid'",
"36, 0, 2, 'linoleic acid'",
"44, 0, 1, 'linoleic acid'",
"3, 0, 1, 'control'",
"6, 1, 4, 'control'",
"8, 1, 2, 'control'",
"12, 1, 2, 'control'",
"12, 0, 1, 'control'",
"15, 0, 1, 'control'",
"16, 0, 1, 'control'",
"18, 0, 2, 'control'",
"20, 1, 1, 'control'",
"22, 0, 1, 'control'",
"24, 1, 1, 'control'",
"28, 0, 3, 'control'",
"30, 0, 1, 'control'",
"30, 1, 1, 'control'",
"33, 0, 1, 'control'",
"42, 1, 1, 'control'")
for (table.data in data.list) {
sqlQueryMix(conn$connection, sprintf(
"INSERT INTO %s VALUES (%s)", "PAL_TRIAL_DATA_TBL",
paste(table.data, collapse = ", ")))
}
trial.df <- conn$table("PAL_TRIAL_DATA_TBL")
Look at the first 5 rows of trial.df:
print(trial.df$Head(5))
## TIME STATUS OCCURRENCES GROUP
## 1 1 0 1 linoleic acid
## 2 5 0 1 linoleic acid
## 3 6 1 1 linoleic acid
## 4 6 1 1 linoleic acid
## 5 9 0 1 linoleic acid
Step 2
Input customer data and use the Kaplan-Meier function to get the survival estimates and log-rank test statistics:
result <- hanaml.kaplan.meier.survival.analysis(trial.df)
print(result[[1]])
## GROUP TIME RISK_NUMBER EVENT_NUMBER PROBABILITY SE CI_LOWER
## 1 control 6 23 4 0.8260870 0.07903420 0.6006104
## 2 control 8 19 2 0.7391304 0.09156054 0.5092094
## 3 control 12 17 2 0.6521739 0.09931135 0.4234786
## 4 control 20 10 1 0.5869565 0.10870510 0.3488771
## 5 control 24 8 1 0.5135870 0.11729213 0.2713181
## 6 control 30 4 1 0.3851902 0.14178453 0.1310408
## 7 control 42 1 1 0.0000000 NA NA
## 8 linoleic acid 6 23 2 0.9130435 0.05875338 0.6949475
## 9 linoleic acid 10 20 2 0.8217391 0.08091666 0.5917337
## 10 linoleic acid 12 17 4 0.6283887 0.10476560 0.3911328
## 11 linoleic acid 24 8 1 0.5498402 0.11748198 0.2997946
## 12 linoleic acid 32 5 1 0.4398721 0.13604288 0.1794154
## CI_UPPER
## 1 0.9309036
## 2 0.8733758
## 3 0.8084500
## 4 0.7636978
## 5 0.7115052
## 6 0.6389978
## 7 NA
## 8 0.9775157
## 9 0.9291698
## 10 0.7945786
## 11 0.7430605
## 12 0.6753069
To compare survival estimates produced from two groups, we use log-rank test. It is a hypothesis test to compare the survival distribution of two groups (some of the observations may be censored) and is used to test the null hypothesis that there is no difference between the populations (treatment group and control group) in the probability of an event (here a death) at any time point. The methods are nonparametric in that they do not make assumptions about the distributions of survival estimates. The analysis is based on the times of events (here deaths). For each such time we calculate the observed number of deaths in each group and the number expected if there were in reality no difference between the groups. It is widely used in clinical trials to establish the efficacy of a new treatment in comparison with a control treatment when the measurement is the time to event (such as the time from initial treatment to death).
Because the log-rank test is purely a test of significance, it cannot provide an estimate of the size of the difference between the groups.
print(result[[2]])
## GROUP TOTAL_RISK OBSERVED EXPECTED LOGRANK_STAT
## 1 control 24 12 10.71468 0.3289494
## 2 linoleic acid 25 10 11.28532 0.3289494
print(result[[3]])
## STAT_NAME STAT_VALUE
## 1 chiSqr 0.3289494
## 2 df 1.0000000
## 3 p-value 0.5662784
Technology Background
Weibull distribution is often used for reliability and survival analysis. It is defined by 3 parameters: shape, scale, and location. Scale works as key to magnify or shrink the curve. Shape is the crucial factor to define how the curve looks like, as described below:
Shape = 1: The failure rate is constant over time, indicating random failure.
Shape < 1: The failure rate decreases over time.
Shape > 1: The failure rate increases over time.
Step 1
Get Weibull distribution and statistics from the linoleic acid treatment data:
sqlQueryMix(conn$connection, "DROP TABLE PAL_DISTRFITCENSORED_DATA_TBL")
sql <- 'CREATE COLUMN TABLE PAL_DISTRFITCENSORED_DATA_TBL (\"LEFT\" DOUBLE, \"RIGHT\" DOUBLE);'
sqlQueryMix(conn$connection, sql)
data.list <- c("1,NULL",
"5,NULL",
"6,6",
"6,6",
"9,NULL",
"10,10",
"10,10",
"10,NULL",
"12,12",
"12,12",
"12,12",
"12,12",
"12,NULL",
"13,NULL",
"15,NULL",
"16,NULL",
"20,NULL",
"24,24",
"24,NULL",
"27,NULL",
"32,32",
"34,NULL",
"36,NULL",
"36,NULL",
"44,NULL")
for (table.data in data.list) {
sqlQueryMix(conn$connection, sprintf(
"INSERT INTO %s VALUES (%s)", "PAL_DISTRFITCENSORED_DATA_TBL",
paste(table.data, collapse = ", ")))
}
linoleic.acid.df <- conn$table("PAL_DISTRFITCENSORED_DATA_TBL")
Look the first 5 rows of linoleic.acid.df:
print(linoleic.acid.df$Head(5))
## LEFT RIGHT
## 1 1 NA
## 2 5 NA
## 3 6 6
## 4 6 6
## 5 9 NA
Call Weibull Distribution function and show the results:
result <- hanaml.DistributionFit(data = linoleic.acid.df, distr.type = "weibull",
optimal.method = "maximum.likelihood", censored=TRUE)
print(result)
## [[1]]
## NAME VALUE
## 1 DISTRIBUTIONNAME WEIBULL
## 2 SCALE 36.3069
## 3 SHAPE 1.40528
##
## [[2]]
## STAT_NAME STAT_VALUE
## 1 LOGLIKELIHOOD -47.03583
Step 2
Get Weibull distribution and statistics from the control treatment data:
sqlQueryMix(conn$connection, "DROP TABLE PAL_DISTRFITCENSORED_DATA_TBL2")
sql <- 'CREATE COLUMN TABLE PAL_DISTRFITCENSORED_DATA_TBL2 (\"LEFT\" DOUBLE, \"RIGHT\" DOUBLE);'
sqlQueryMix(conn$connection, sql)
data.list <- c("3,NULL",
"6,6",
"6,6",
"6,6",
"6,6",
"8,8",
"8,8",
"12,12",
"12,12",
"12,NULL",
"15,NULL",
"16,NULL",
"18,NULL",
"18,NULL",
"20,20",
"22,NULL",
"24,24",
"28,NULL",
"28,NULL",
"28,NULL",
"30,30",
"30,NULL",
"33,NULL",
"42,42")
for (table.data in data.list) {
sqlQueryMix(conn$connection, sprintf(
"INSERT INTO %s VALUES (%s)", "PAL_DISTRFITCENSORED_DATA_TBL2",
paste(table.data, collapse = ", ")))
}
control.df <- conn$table("PAL_DISTRFITCENSORED_DATA_TBL2")
print(control.df)
Look the first 5 rows of control.df:
print(control.df$Head(5))
## LEFT RIGHT
## 1 3 NA
## 2 6 6
## 3 6 6
## 4 6 6
## 5 6 6
Call Weibull Distribution function and show the results:
result <- hanaml.DistributionFit(data = control.df, distr.type = "weibull",
optimal.method = "maximum.likelihood", censored=TRUE)
print(result)
## [[1]]
## NAME VALUE
## 1 DISTRIBUTIONNAME WEIBULL
## 2 SCALE 31.2021
## 3 SHAPE 1.43217
##
## [[2]]
## STAT_NAME STAT_VALUE
## 1 LOGLIKELIHOOD -54.02222
Step 3
Get the CDF (cumulative distribution function) of Weibull distribution for the linoleic acid treatment data:
sqlQueryMix(conn$connection, "DROP TABLE PAL_DISTRPROB_DATA_TBL")
sql <- 'CREATE COLUMN TABLE PAL_DISTRPROB_DATA_TBL (\"DATACOL\" DOUBLE);'
sqlQueryMix(conn$connection, sql)
data.list <- c(6,8,12,20,24,30,42)
for (table.data in data.list) {
sqlQueryMix(conn$connection, sprintf(
"INSERT INTO %s VALUES (%s)", "PAL_DISTRPROB_DATA_TBL",
paste(table.data, collapse = ", ")))
}
distr.prob.df <- conn$table("PAL_DISTRPROB_DATA_TBL")
Show the distr.prob.df:
print(distr.prob.df)
## DATACOL
## 1 6
## 2 8
## 3 12
## 4 20
## 5 24
## 6 30
## 7 42
Invoke hanaml.CDF and show the result:
result <- hanaml.CDF(distr.prob.df, distr.info = list('Weibull', 1.40528, 36.3069), complementary=FALSE)
print(result)
## DATACOL PROBABILITY
## 1 6 0.07657965
## 2 8 0.11251514
## 3 12 0.19024473
## 4 20 0.35118219
## 5 24 0.42818255
## 6 30 0.53457258
## 7 42 0.70687364
Step 4
Get the CDF (cumulative distribution function) of Weibull distribution for the control treatment data:
sqlQueryMix(conn$connection, "DROP TABLE PAL_DISTRPROB_DATA_TBL")
sql <- 'CREATE COLUMN TABLE PAL_DISTRPROB_DATA_TBL (\"DATACOL\" DOUBLE);'
sqlQueryMix(conn$connection, sql)
data.list <- c(6,10,12,24,32)
for (table.data in data.list) {
sqlQueryMix(conn$connection, sprintf(
"INSERT INTO %s VALUES (%s)", "PAL_DISTRPROB_DATA_TBL",
paste(table.data, collapse = ", ")))
}
distr.prob.df <- conn$table("PAL_DISTRPROB_DATA_TBL")
Show the distr.prob.df:
print(distr.prob.df)
## DATACOL
## 1 6
## 2 10
## 3 12
## 4 24
## 5 32
Invoke hanaml.CDF and show the result:
result <- hanaml.CDF(distr.prob.df, distr.info = list('Weibull', 1.71902, 20.444), complementary=FALSE)
print(result)
## DATACOL PROBABILITY
## 1 6 0.1144566
## 2 10 0.2536074
## 3 12 0.3297943
## 4 24 0.7321726
## 5 32 0.8846979
tbls <- list("PAL_TRIAL_DATA_TBL",
"PAL_DISTRFITCENSORED_DATA_TBL",
"PAL_DISTRFITCENSORED_DATA_TBL2",
"PAL_DISTRPROB_DATA_TBL")
for (tbl in tbls) {
query <- sprintf("DROP TABLE %s", tbl)
sqlQueryMix(conn$connection, query)
}
conn$close()