Earlier on, we had little automation around the maintenance of our Airflow cluster. Those of you familiar with Airflow may be aware of the bloat accumulated over time from DAG runs. Metadata for each DAG run and its tasks are recorded in the appropriate tables; log files are also written for every task.
We currently output log files locally (as opposed to cloud storage). For our use case, 30 days worth of metadata retention was sufficient. To delete database table entries and log files older than 30 days, we leveraged maintenance DAGs shared by the team at Clairvoyant.
At the time of this writing, we have almost 1,000 DAGs. As you can expect, table cleanup and log cleanup are a couple of the most important maintenance tasks for us. Specifically, we relied on the db-cleanup and log-cleanup DAGs to minimize database and server space.
With the upgrade to Airflow 2.0, the immense performance improvements to the scheduler allowed us to increase the frequency of many of our DAGs. We talk more about the major gains we were able to achieve, along with learnings, here. One third of our DAGs now run every five minutes, and a sizable chunk execute every minute! Yet, this quickly equated to over 250,000 of database table entries and log files per day.