CoxProportionalHazardModel

class hana_ml.algorithms.pal.regression.CoxProportionalHazardModel(tie_method=None, status_col=None, max_iter=None, convergence_criterion=None, significance_level=None, calculate_hazard=None, output_fitted=None, type_kind=None, thread_ratio=0.0)

Cox proportional hazard model (CoxPHM) is a special generalized linear model. It is a well-known realization-of-survival model that demonstrates failure or death at a certain time.

Parameters
tie_method{'breslow', 'efron'}, optional

The method to deal with tied events.

Defaults to 'efron'.

status_colbool, optional(deprecated)

If a status column is defined for right-censored data:

  • False : No status column. All response times are failure/death.

  • True : There is a status column, of which 0 indicates right-censored data, and 1 indicates failure/death.

Defaults to True.

Deprecated, please use parameter status_col in the fit() method.

max_iterint, optional

Maximum number of iterations for numeric optimization.

convergence_criterionfloat, optional

Convergence criterion of coefficients for numeric optimization.

Defaults to 0.

significance_levelfloat, optional

Significance level for the confidence interval of estimated coefficients.

Defaults to 0.05.

calculate_hazardbool, optional

Controls whether to calculate hazard function as well as survival function.

  • False : Does not calculate hazard function.

  • True: Calculates hazard function.

Defaults to True.

output_fittedbool, optional

Controls whether to output the fitted response:

  • False : Does not output the fitted response.

  • True: Outputs the fitted response.

Defaults to False.

type_kindstr, optional(deprecated)

The prediction type:

  • 'risk': Predicts in risk space

  • 'lp': Predicts in linear predictor space

Default Value is 'risk'

Deprecated, please use parameter pred_type of the predict() method.

thread_ratiofloat, optional

Controls the proportion of available threads to use for fitting.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside that range tell PAL to heuristically determine the number of threads to use.

Does not affect fitting.

Defaults to 0.

Examples

>>> df1.collect()
    ID  TIME    STATUS  X1  X2
0    1     4         1   0   0
1    2     3         1   2   0
2    3     1         1   1   0
3    4     1         0   1   0
4    5     2         1   1   1
5    6     2         1   0   1
6    7     3         0   0   1

Training the model:

>>> cox = CoxProportionalHazardModel(
significance_level= 0.05, calculate_hazard='yes', type_kind='risk')
>>> cox.fit(data=df1, key='ID', features=['STATUS', 'X1', 'X2'], label='TIME')

Prediction:

>>> df2.collect()
    ID  X1  X2
0    1   0   0
1    2   2   0
2    3   1   0
3    4   1   0
4    5   1   1
5    6   0   1
6    7   0   1
>>> cox.predict(data=full_tbl, key='ID',features=['STATUS', 'X1', 'X2']).collect()
    ID    PREDICTION           SE     CI_LOWER     CI_UPPER
0    1   0.383590423  0.412526262  0.046607574  3.157032199
1    2   1.829758442  1.385833778  0.414672719  8.073875617
2    3   0.837781484  0.400894077   0.32795551  2.140161678
3    4   0.837781484  0.400894077   0.32795551  2.140161678
Attributes
statistics_DataFrame

Regression-related statistics, such as r-square, log-likelihood, aic.

coefficient_DataFrame

Fitted regression coefficients.

covariance_varianceDataFrame

Co-Variance related data.

hazard_DataFrame

Statistics related to Time, Hazard, Survival.

fitted_DataFrame

Predicted dependent variable values for training data. Set to None if the training data has no row IDs.

Methods

fit(data[, key, features, label, status_col])

Fit regression model based on training data.

predict(data[, key, features, thread_ratio, ...])

Predict dependent variable values based on fitted model.

score(data[, key, features, label])

Returns the coefficient of determination R2 of the prediction.

property fit_hdbprocedure

Returns the generated hdbprocedure for fit.

property predict_hdbprocedure

Returns the generated hdbprocedure for predict.

fit(data, key=None, features=None, label=None, status_col=None)

Fit regression model based on training data.

Parameters
dataDataFrame

Training data.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns(inclusive of covariates as well as status column).

If not provided, defaults to all non-key, non-label columns.

labelstr, optional

Name of the dependent variable(indicating the time before a failure/death event occurs or data is right censored.)

Defaults to the last non-ID column. (This is not the PAL default.)

status_colbool, optional

Specifies if a status column is defined for right-censored data.

False : No status column. All response times are failure/death.

True : There is a status column in data, of which 0 indicates right-censored data and 1 indicates failure/death. The column should correspond to:

  • the 1st column in features if features is specified;

  • the 1st non-key, non-label column in data if features is not specified.

Defaults to True.

Returns
Fitted object.
predict(data, key=None, features=None, thread_ratio=None, pred_type=None, significance_level=None)

Predict dependent variable values based on fitted model.

Parameters
dataDataFrame

Independent variable values used for prediction.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the covariates.

thread_ratiofloat, optional(deprecated)

Controls the proportion of available threads to use for prediction.

The value range is from 0 to 1, where 0 indicates a single thread, and 1 indicates up to all available threads.

Values between 0 and 1 will use that percentage of available threads.

Values outside that range tell PAL to heuristically determine the number of threads to use.

Does not affect fitting.

Defaults to 0.

Deprecated and ineffective.

pred_type: str, optional

The prediction type:

  • 'risk': Predicts in risk space

  • 'lp': Predicts in linear predictor space

Default Value is 'risk'

significance_levelfloat, optional

Significance level for the confidence interval and prediction interval.

Defaults to 0.05.

Returns
DataFrame

Predicted values, structured as follows:

  • ID column, with same name and type as data 's ID column.

  • VALUE, type DOUBLE, representing predicted values.

score(data, key=None, features=None, label=None)

Returns the coefficient of determination R2 of the prediction.

Parameters
dataDataFrame

Data on which to assess model performance.

keystr, optional

Name of the ID column.

Mandatory if data is not indexed, or the index of data contains multiple columns.

Defaults to the single index column of data if not provided.

featureslist of str, optional

Names of the feature columns.

labelstr, optional

Name of the dependent variable.

Defaults to the last non-ID column(this is not the PAL default).

Returns
float

The coefficient of determination R2 of the prediction on the given data.

Inherited Methods from PALBase

Besides those methods mentioned above, the CoxProportionalHazardModel class also inherits methods from PALBase class, please refer to PAL Base for more details.