Survival Analysis: Time-to-Event Data, Kaplan-Meier Curves, and Cox Proportional Hazards Regression
A practical guide to survival analysis covering censored data, Kaplan-Meier estimation, log-rank tests, and Cox proportional hazards regression โ the core toolkit for analyzing how long it takes for events to occur.
What You'll Learn
- โExplain why standard regression methods fail for time-to-event data and what makes survival analysis necessary
- โConstruct and interpret a Kaplan-Meier survival curve including median survival time and confidence intervals
- โPerform and interpret a log-rank test to compare survival between groups
- โFit a Cox proportional hazards model and interpret hazard ratios correctly
1. Why Survival Analysis Exists and When You Need It
Survival analysis answers one question: how long until something happens? The "something" doesn't have to be death โ it could be time until a machine breaks, time until a customer cancels a subscription, time until a patient relapses, or time until a student drops out of a program. What unites these problems is that they're all measuring duration until an event, and that's where standard methods fall apart. Here's the core problem. You run a 5-year clinical trial with 200 patients. At the end of the study, 120 patients have died and 80 are still alive. If you just calculate the proportion who died (120/200 = 60%), you ignore the fact that some patients died at month 3 and others at month 58. Timing matters. And those 80 surviving patients? You know they survived at least 5 years, but you don't know when they'll eventually die. Their data is incomplete โ they're what we call censored observations. Censoring is the defining feature of survival data, and it's the reason you can't just use linear regression or logistic regression. A censored observation tells you the event hadn't occurred yet by a certain time, but doesn't tell you when it will occur. Right-censoring (the most common type) happens when the study ends before the event occurs, the subject drops out, or the subject is lost to follow-up. If you exclude censored observations, you bias results toward shorter survival times โ you're only analyzing people who died quickly enough to be captured in your study window. If you treat censored observations as if the event occurred at the censoring time, you underestimate survival. Neither approach works. Survival analysis handles censoring correctly by using the partial information from censored observations: we know the patient survived at least this long, even if we don't know the exact event time. This is something no other regression framework does naturally.
Key Points
- โขSurvival analysis handles time-to-event data where some observations are incomplete (censored)
- โขRight-censoring occurs when the study ends, or subjects drop out, before the event happens โ their true event time is unknown
- โขExcluding or mishandling censored data introduces systematic bias toward shorter survival times
- โขStandard regression (linear, logistic) cannot properly account for censoring โ survival methods are required
2. Kaplan-Meier Curves: Estimating and Visualizing Survival
The Kaplan-Meier (KM) estimator is the most widely used tool in survival analysis. It estimates the survival function โ the probability of surviving past any given time point โ directly from observed data, including censored observations. The math is surprisingly straightforward. At each time point where an event occurs, the KM estimator calculates the conditional probability of surviving past that moment, given that the subject was still at risk. Then it multiplies these conditional probabilities together to get cumulative survival. Worked example: You follow 10 patients. Events (deaths) happen at months 2, 5, 8, and 12. One patient is censored (lost to follow-up) at month 6. At month 2: 10 patients at risk, 1 dies. Survival probability = 9/10 = 0.90. At month 5: 9 patients still at risk, 1 dies. Conditional probability = 8/9. Cumulative survival = 0.90 ร (8/9) = 0.80. At month 8: the patient censored at month 6 drops from the risk set, leaving 7 at risk. One dies. Conditional = 6/7. Cumulative = 0.80 ร (6/7) = 0.686. At month 12: 6 at risk, 1 dies. Conditional = 5/6. Cumulative = 0.686 ร (5/6) = 0.571. Notice what happened at month 8: the censored patient contributed to the survival estimate up through month 6 (they were in the risk set) but was removed before month 8 because we lost track of them. This is how KM uses partial information without bias. The censored observation isn't thrown away โ it shrinks the denominator (risk set) at later time points. The resulting KM curve is a step function that drops at each event time. The median survival time is where the curve crosses 0.50 on the y-axis โ the time by which half the subjects have experienced the event. Confidence intervals (typically Greenwood's formula) get wider as time progresses because fewer subjects remain at risk. To compare two KM curves โ say, treatment vs. control โ you use the log-rank test. It's essentially a chi-squared test that compares the observed number of events in each group to the expected number under the null hypothesis that the curves are identical. A significant log-rank test (p < 0.05) tells you the survival distributions differ, but it doesn't tell you by how much or control for confounders. For that, you need Cox regression.
Key Points
- โขKM survival probability at time t = product of conditional survival probabilities at each prior event time
- โขCensored subjects contribute to the risk set up to their censoring time, then are removed โ partial information is preserved
- โขMedian survival time is where the KM curve crosses 50% โ it may not be estimable if fewer than half the subjects experience the event
- โขThe log-rank test compares two or more KM curves but doesn't adjust for confounders or quantify the difference
3. Cox Proportional Hazards Regression: The Workhorse Model
The Cox proportional hazards (PH) model is to survival analysis what linear regression is to continuous outcomes โ the default, go-to method. It lets you estimate the effect of multiple predictors on survival time simultaneously, while adjusting for confounders. The model focuses on the hazard function, h(t), which represents the instantaneous rate of the event occurring at time t, given survival up to that point. Think of hazard as "risk per unit time right now." The Cox model specifies: h(t) = hโ(t) ร exp(bโxโ + bโxโ + ...), where hโ(t) is the baseline hazard (left unspecified โ this is why Cox regression is called "semi-parametric") and the exponential term captures how covariates shift the hazard up or down. The key output is the hazard ratio (HR). If the coefficient for a treatment variable is b = -0.30, the hazard ratio is exp(-0.30) = 0.74. Interpretation: the treatment group has 74% the hazard of the control group at any point in time, or equivalently, a 26% reduction in the instantaneous risk of the event. HR < 1 means lower risk (protective). HR > 1 means higher risk (harmful). HR = 1 means no effect. Worked example from a real study design: A trial compares a new drug vs. placebo for 500 cancer patients. Cox regression output shows: - Treatment (drug vs placebo): HR = 0.65, 95% CI (0.48, 0.88), p = 0.005 - Age (per year): HR = 1.03, 95% CI (1.01, 1.05), p = 0.002 - Stage (III vs II): HR = 2.15, 95% CI (1.60, 2.89), p < 0.001 Interpretation: After adjusting for age and stage, patients on the drug have 35% lower hazard of death compared to placebo (HR = 0.65). Each additional year of age increases the hazard by 3%. Stage III patients have 2.15 times the hazard of Stage II patients. The proportional hazards assumption is the model's critical requirement: the ratio of hazards between any two groups must remain constant over time. If the drug works well for the first year but its effect fades, the proportional hazards assumption is violated. You check this with Schoenfeld residual plots (the residuals should show no trend over time) or the scaled Schoenfeld test. Violations can be handled by stratification, time-varying coefficients, or switching to a parametric model.
Key Points
- โขCox model: h(t) = hโ(t) ร exp(bโxโ + ...) โ semi-parametric because baseline hazard hโ(t) is left unspecified
- โขHazard ratio = exp(coefficient): HR < 1 is protective, HR > 1 is harmful, HR = 1 is no effect
- โขThe proportional hazards assumption requires the hazard ratio between groups to stay constant over time
- โขCheck the PH assumption with Schoenfeld residuals โ a trend over time signals a violation
4. Practical Pitfalls and When to Use What
After learning the mechanics, the question is: which tool do you pull out, and when? Here's a decision framework. Use Kaplan-Meier when you want to visualize survival over time, estimate median survival, or compare two groups without adjusting for confounders. KM is descriptive โ it shows you what happened. Most published survival studies lead with a KM plot because it's intuitive and immediately communicates the pattern. Use Cox regression when you need to adjust for confounders, assess the effect of continuous predictors, or estimate hazard ratios with confidence intervals. Cox regression is inferential โ it tells you which variables matter after accounting for others. Use the log-rank test to formally test whether two KM curves differ. It's the hypothesis test companion to the KM plot. But if you have more than one or two covariates, move to Cox regression. Common pitfalls worth flagging. First, immortal time bias: if the treatment group includes time before treatment started (e.g., patients who survived long enough to receive a transplant are credited with the pre-transplant survival time), the treatment group looks artificially better. The fix is to start the clock at the relevant exposure time, or use time-varying covariates. Second, competing risks: in a cancer study, if a patient dies of a heart attack, that's not the event of interest (cancer death), but it prevents the cancer death from ever being observed. Standard KM and Cox assume that censoring is independent of the event โ competing risks violate this. The solution is the Fine-Gray competing risks model or cause-specific hazard models. Third, informative censoring: if patients drop out because they're getting sicker (not random dropout), the standard assumption that censoring is non-informative breaks down. Sensitivity analysis and joint modeling of the event process and dropout process can help. StatsIQ covers survival analysis concepts with practice problems that walk you through interpreting KM curves, calculating hazard ratios, and spotting violations of the proportional hazards assumption โ the kinds of questions that show up on advanced statistics exams.
Key Points
- โขUse KM for visualization and descriptive estimation; use Cox regression for multivariable adjustment and hazard ratios
- โขImmortal time bias inflates treatment benefit โ start the clock at exposure time, not study entry
- โขCompeting risks violate the independence-of-censoring assumption and require specialized models (Fine-Gray)
- โขAlways check for informative censoring โ dropout that depends on health status biases standard survival estimates
Key Takeaways
- โ The Kaplan-Meier estimator is non-parametric and makes no assumption about the shape of the survival distribution
- โ The log-rank test is most powerful when the proportional hazards assumption holds โ it has low power against crossing survival curves
- โ A hazard ratio of 0.50 means a 50% reduction in the instantaneous risk of the event at every time point
- โ The Cox model is semi-parametric: it does not estimate the baseline hazard, only the hazard ratios
- โ Median survival from KM may not be estimable if fewer than 50% of subjects experience the event during follow-up
- โ Right-censoring is the most common type, but left-censoring (event occurred before observation began) and interval censoring (event occurred between two observation times) also exist
Practice Questions
1. In a study of 100 patients, 60 died during follow-up and 40 were still alive at the end of the study. A colleague calculates the mortality rate as 60/100 = 60%. What is wrong with this approach?
2. A Cox regression for time to hospital readmission shows: Age (per year) HR = 1.02, 95% CI (1.01, 1.04); Diabetes (yes vs no) HR = 1.85, 95% CI (1.20, 2.85); Treatment (new vs standard) HR = 0.60, 95% CI (0.42, 0.86). Interpret the treatment hazard ratio.
3. A Kaplan-Meier curve for Group A crosses above the curve for Group B at around month 18. The log-rank test gives p = 0.35. What should you conclude?
FAQs
Common questions about this topic
No. Despite the name, survival analysis applies to any time-to-event outcome: time to machine failure, time to customer churn, time to disease recurrence, time to first job after graduation, time to loan default. The math is identical regardless of the event โ what matters is that you are measuring duration until something happens and that some observations may be censored.
The hazard ratios become averages over time rather than constant effects, which can be misleading. If the treatment helps initially but the benefit fades, a single HR underestimates the early benefit and overestimates the late benefit. Solutions include stratified Cox models (allowing different baseline hazards per group), adding time-covariate interactions, or using alternative models like the accelerated failure time model.
Yes. StatsIQ includes problems on interpreting Kaplan-Meier curves, calculating and interpreting hazard ratios from Cox regression output, checking the proportional hazards assumption, and identifying common pitfalls like immortal time bias and competing risks.