Troubleshooting Postgres CrashLoopBackOff
🤔 Problem
It can happen after an unexpected crash or sudden stop of one of the Postgres containers that the database can no longer locate a valid checkpoint.
The following log can be observed in the concerned Postgres container
PANIC: could not locate a valid checkpoint record
Restarting the container doesn’t seem to solve automatically the issue as Postgres is looking for a checkpoint record that is probably corrupted.
🌱 Solution
We would like to reset the write-ahead log and other control information of a PostgreSQL database cluster. The stored data should not be affected.
Proceed with the following steps in the concerned deployment.yaml
locate the faulty
postgres
containeradd the fields
command: ["sleep"]
args: ["1000"]
Save & Re deploy (The faulty container should not immediately restart when it fails)
Open a bash in the pod
run
su postgres pg_resetwal /var/lib/postgresql/data
CODEOnce the database accessible, revert the changes from the steps
2.
📎 Related articles
https://stackoverflow.com/questions/60604699/postgres-k8s-panic-could-not-locate-a-valid-checkpoint-record-crashloopbachttps://www.postgresql.r2schools.com/postgresql-panic-could-not-locate-a-valid-checkpoint-record/