Troubleshooting Postgres CrashLoopBackOff

🤔 Problem

It can happen after an unexpected crash or sudden stop of one of the Postgres containers that the database can no longer locate a valid checkpoint.

The following log can be observed in the concerned Postgres container

CODE

PANIC:  could not locate a valid checkpoint record

Restarting the container doesn’t seem to solve automatically the issue as Postgres is looking for a checkpoint record that is probably corrupted.

We would like to reset the write-ahead log and other control information of a PostgreSQL database cluster. The stored data should not be affected.

Proceed with the following steps in the concerned deployment.yaml

locate the faulty postgres container
add the fields
1. command: ["sleep"]
2. args: ["1000"]
Save & Re deploy (The faulty container should not immediately restart when it fails)
Open a bash in the pod

run

CODE

su postgres
pg_resetwal /var/lib/postgresql/data