Case study: optimization of weirdly picked bad plan

submited by
Style Pass
2024-10-29 08:30:05

We recently hit an interesting case where planner picked wrong plan. Kinda. And figured it could be used to show how to deal with such cases.

So, we have some databases on PostgreSQL 14 (yes, I know, we should upgrade, but it is LONG project to upgrade, so it's not really on the table now).

Last week someone noticed problems with website. These were tracked to queries getting killed because of statement_timeout of 3 minutes.

Interesting thing was that on primary db the query returned in couple of milliseconds, but on replica it got killed after 3 minutes. Repeatably.

So, it's using an index. What can I see about it, and the table itself? Well, the table is rather large: 232GB of data in table, additional 2GB in toast. And ~ 380GB in indexes. Many indexes.

On primary, where everything was fast, Pg decided to use index to filter based on mutilation_id and workflow_state, and then it got max value of cached_due_date by simply reading the value from all rows that matched mutilation_id/workflow_state combo.

Leave a Comment