🤔 Problem

It can happen after an unexpected crash or sudden stop of one of the Postgres containers that the database can no longer locate a valid checkpoint.

The following log can be observed in the concerned Postgres container

PANIC:  could not locate a valid checkpoint record

Restarting the container doesn’t seem to solve automatically the issue as Postgres is looking for a checkpoint record that is probably corrupted.

🌱 Solution

We would like to reset the write-ahead log and other control information of a PostgreSQL database cluster. The stored data should not be affected.

Proceed with the following steps in the concerned deployment.yaml

  1. locate the faulty postgres container

  2. add the fields

    1. command: ["sleep"]

    2. args: ["1000"]

  3. Save & Re deploy (The faulty container should not immediately restart when it fails)

  4. Open a bash in the pod

  5. run

    su postgres
    pg_resetwal /var/lib/postgresql/data

  6. Once the database accessible, revert the changes from the steps 2.

📎 Related articles