Please also see the Python version of this article. To skip straight to the code, please check out The Crosstab Kite's gists repo. Analytics tables an

How to compute Kaplan-Meier survival curves in SQL

submited by
Style Pass
2021-07-20 22:30:05

Please also see the Python version of this article. To skip straight to the code, please check out The Crosstab Kite's gists repo.

Analytics tables and dashboards rarely answer these questions directly and correctly. Why? Because the survival analysis methods that best answer these questions are implemented in R and Python packages that can't be incorporated into SQL-based data pipelines and dashboards.

In this article, I show how to compute survival curves and cumulative hazard curves directly in SQL (Postgres, specifically). Please copy and use the code in this post—let me know if you find it useful! You can also find the code with a permissive license in The Crosstab Kite gists repo.

This post assumes you know what survival and cumulative hazard curves are and why you would want to estimate them. For an intro to these concepts, I recommend the documentation for the Lifelines and Scikit-survival packages, and you should also check out my other articles on survival analysis.

I'll use the Retail Rocket dataset, which I downloaded from Kaggle. This dataset is an event log, where each row represents an interaction between a Retail Rocket user (visitorid) and a product (itemid) at a moment in time (event_at). Here's a sample:

Leave a Comment