The Cox survival guide: hazards, models and time

Abstract

This note is an overview of survival analysis with a focus on the Cox proportional hazards model, a key tool for modelling time-to-event data. We define hazard functions, address right-censoring, and derive estimators like Nelson-Aalen and Kaplan-Meier. The Cox model's semiparametric form and partial likelihood estimation are discussed, along with evaluation metrics such as concordance probability.

What is hazard?

The cumulative hazard function for a time-to-event (e.g. death in medical trials) distribution can be defined in the most generality using the Lebesgue-Stieltjes integral, which accommodates both continuous and discrete distributions and more.

For a survival time \( T \) with survival function \( S(t) = P(T > t) \), the cumulative hazard function \( H(t) \) is defined as:

\[ H(t) = \int_0^t \frac{dF(s)}{S(s^-)}, \]

where \( F(s) = P(T \leq s) \) and \( S(s^-) = \lim_{u \uparrow s} S(u) = P[T\ge s] \) is the left-continuous survival function. Here the integral is a Lebesgue-Stieltjes integral (TLDR this makes sense because F is monotone thus of finite variation).

Continuous Distributions:

If \( T \) is a continuous random variable with density \(f\), then \( S(s^-) = S(s) \) and the cumulative hazard reduces to

\[ \begin{align} H(t) &= \int_0^t h(s) ds \notag\\ h(t) &= \frac{f(t)}{S(t)} \label{e-hazard-func} \end{align} \]

Notice that \(f(t)=-S'(t)\) so by \(\eqref{e-hazard-func}\) the survival function satisfies the ODE

\[h(t) = - S'(t)/S(t)\]

with boundary values \(S(0)=1, S(\infty)=0\). One concludes that \( S(t) = \exp(-H(t)) \). Given an estimator of \(H\) or \(h\), one can estimate \(S\) through this relation.

Discrete Distributions:

If \( T \) is discrete with support \(\{ t_1, t_2, \dots \}\), the cumulative hazard is a sum over discrete hazards:

\[ H(t) = \sum_{t_j \leq t} h(t_j), \quad \text{where } h(t_j) = \frac{F(t_j)-F(t_j^-)}{P[T\ge t_j]} = P[T = t_j \mid T \geq t_j]. \]

We can and will assume that \(t_1<t_2<t_3\) and so on. Any \(t\ge t_1\) there exists a unique \(j\) such that \(t_j\le t<t_{j+1}\). For \(t\ge t_1\), we have

\[ \begin{align*} S(t) &= P[T>t] = P[T>t_i, \forall i = 1, ..., j] \\ & = P[T\ge t_1] P[T>t_1|T\ge t_1] P[T>t_2 | T>t_1] ... P[T>t_j|T>t_{j-1}] \end{align*} \]

where we used the fact that \(P[T\ge t_1]=1\). Notice that \(\{T>t_{i-1}\}= \{T\ge t_i\}\), hence

\[ S(t) = \prod_{t_j \leq t} \left(1 - h(t_j)\right) \]

Note that \( H(t) \neq -\log S(t) \) here; instead, \( -\log S(t) = \sum_{t_j \leq t} -\log(1 - h(t_j)) \), which differs from \( H(t) \).

Again given an estimator of \(h\) (note this \(h\) is different from the \(h\) in the continuous case), one can estimator \(S\) through this relation.

Other cases:

For mixed distributions with both continuous and discrete components, \( H(t) \) combines integrals over continuous regions and sums over discrete jumps. But it can get more interseting, think Cantor's function (devil's staircase), how can we express the hazard differently?

Right-cencoring

In medical applications, it is often necessary to incorporate a censoring distribution on top of the time-to-event distribution. Let \(C\) denote a nonnegative random variable and assume that the observation is \(Y = \min(T,C)\) and \(\Delta = I(T\le C)\). The indicator evaluates to 1 if the event happens no later than the censoring time (eg dead at time \(Y\)), and 0 otherwise (eg alive but left the trial at time \(Y\)). Typically one assume independence between \(T\) and \(C\). What we really cares about is the time to event distribution which is not observed directly. How to estimate the hazard (hence the survival function from discussion above) in such case?

It is a nice exercise to show that the independence assumption yields (see note¹ for derivation)

\[ H_T(t) = \int_0^t \frac{dF_{Y,1}(u)}{S_Y(u^-)} \]

where \(H_T\) is the cumulative hazard function of \(T\) and \(F_{Y,1}(u)=P[Y\le u, \Delta=1]\). Now the whole thing is estimable by replacing all terms by their empirical counterpart, which is

\[ \hat H_T(t) = \sum_{i=1}^n \frac{\Delta_i I(Y_i\le t)}{\sum_{j} I(Y_j\ge Y_i)}. \]

Simple observations about the function \(\hat H_T\):

piecewise constant
right continuous
it jumps and only jumps at \(T_i\) when \(\Delta_i=1\) (equivalently \(T_i=Y_i\))
those \(T_i\) with \(\Delta_i=1\) do not have to be distinct, it is possible that \(T_i=T_j, \Delta_i=\Delta_j=1\) for some \(i\neq j\).

Let \(\{t_1, ...,t_m\} = \mbox{set}(\{T_i: \Delta_i=1\})\). We conclude that \(\hat H\)

jumps at all \(t_i\)
with magnitude \(D_i/N_i\) where \(D_i = \sum_j I(T_j=t_i, \Delta_j = 1)\) is the number of observed events at time \(t_i\), and \(N_i=\sum_{j}I(Y_j\ge t_i)\) is the number of invididuals at risk just before \(t_i\)

Summing over these ratios gives an equivalent expression

\[ \hat H_T(t) = \sum_{i:t_i\le t} \frac{D_i}{N_i}. \]

Let's recap: we estimate the hazard of \(T\) using empirical distribution \(Y\) and empirical conditional distribution of \(Y\) given \(\Delta=1\), then using the general relationship between hazard and survival function to get an estimate of the survival function of \(T\) so we come full circle!

We have thus two estimators for the survival function of \(T\).

\[ \begin{align*} \hat S_T^{\rm{NA}}(t) &= \exp( - \hat H_T(t)) \\ \hat S_T^{\rm{KM}}(t) &= \prod_{t_i\le t} (1 - \frac{D_i}{N_i}) \end{align*} \]

where KM stands for Kaplan-Meier and NA for Nelson-Aalen.

Cox proportional hazard model

So far we have discussed observation of time-to-event data (censored or not) without covariate effect. Cox PH model incorporates the covariate effect into the modelling of hazard.

Fun fact

Google scholar reported 63k citations [data as of 11 Feb 2025] of the paper in 1972 by Cox ⁵ on his nowadays called "proportional hazard model". Some says that this is the 2nd most cited papers in science, but that is probably not true any more given that the deep learning industry has exploded the number of publications e.g. the transformer paper ⁶ has 152k citations now, which was published in 2017.

Anyhow, Cox proportional hazard model remains the de facto standard model in survival analysis. It assumes a semiparametric form for the hazard (assuming the hazard measure absolutely continuous)

\[ h_{T|X}(t|x) = h_0(t) \exp(\beta\cdot x) \quad t>0, \, x\in \RR^p. \]

It is proportional with regard to the change of covariate \(x\) for each invididual i.e. \(\frac{h(t|x)}{h(t|x+\delta)}\) is constant over time. This is a strong assumption yet quite flexible because \(h_0\) is an arbitrary function and it allows for (parametric) covariate effect.

Estimation and hypothesis testing of \(\beta\) is obviously of utmost importance. This is done by maximizating partial likelihood (likelihood conditioning on the occurence time of the observed events, due to PH assumption, we have only \(\beta\) in the partial likelihood). The exposition of some lecture notes² is very clear so I will not waste tokens on it. See also the notes mentioned before¹ for a slightly different perspective.

Evaluating models

Having estimated the beta parameters in the Cox PH model, we can rank the riskiness of individuals using \(g(x)= x\cdot\hat\beta\) which assigns the same ranking to individuals as with the hazard model at any point in time, thanks to the PH assumption and the monotonicity of exponential function.

A higher risk score translates to a shorter time to event, while a lower risk score translates to a longer time to event. A natural evaluation metric is the quality of this ranking. One widely used measure of this type is the concordance probability. The idea is to capture the probability that a pair of independent samples \((T_1,C_1,X_1), (T_2,C_2,X_2)\) are ordered correctly by the risk scores \(g(X_1)\) and \(g(X_2)\). To formulate this precisely, introduce for \(i\in \{1,2\}\) the events

\[ \begin{align*} E_i &= \{T_i<\max(T_1,T_2)\} \\ F_i &= \{g(X_i)>\min(g(X_1),g(X_2))\} \end{align*} \]

Clearly \(E_1\cup E_2 = \{T_1\neq T_2\}\).

The concordance probability is

\[ \begin{equation} \label{e-cpop} P[ \cup_{i=1}^2 (E_i\cap F_i) | E_1\cup E_2 ] = \frac{ P[F_1 \cap E_1] + P[F_2 \cap E_2]}{P[E_1]+P[E_2]} \end{equation} \]

Note however that \(T_1,T_2\) are not always observable. To relate with the observable \(Y\) and \(\Delta\) defined earlier, observe that

\[ \begin{align*} E_1 =& \{T_1<T_2\} \\ =&\{\Delta_1=1, \Delta_2=1, Y_1<Y_2\} \\ &\cup \{\Delta_1=1, \Delta_2=0, Y_1<T_2\} \\ &\cup \{\Delta_1=0, \Delta_2=1, T_1<Y_2\} \\ &\cup \{\Delta_1=0, \Delta_2=0, T_1<T_2\} \\ \supset& \{\Delta_1 = 1, Y_1<Y_2\} =:D_1 \end{align*} \]

and similarly \(E_2\supset \{\Delta_2 = 1, Y_1>Y_2\}=: D_2\).

The standard C-index of Harrell is not a consistent estimator of \(\eqref{e-cpop}\), but rather one that estimates analogous probability with \(D_i\) in place of \(E_i\). More concretely, it is the proportion of concordant pairs with the smaller observation being the event time (\(\Delta=1\)) amongst all pairs with with smaller one being the event time (aka the comparable pairs). One way to address the inconsistency is the CPE method³ which is robost with regard to how censoring is performed. Yet another proposal is the C-index IPCW⁴ which is a weighted version of C-index using some consistent estimate of the survival function of the censoring distribution in the weights.