Skip to content

Errors

Timeline | server's history | checkpoint error

Sympthoms:

The operator is trying to create a replica but it can't and the new instance fails with this error

...
requested timeline XXX is not a child of this server's history
Latest checkpoint is at YYY on timeline XXX, but in the history of the requested timeline, the server forked off from that timeline at YYY

Destroying the failing replicas and forcing to create a new one does not solve the problem.

Cause:

There is difference between the primary instance and the backups

Workaround:

  • Do a manual backup not with the operator
  • Change the cluster to 1 instance. Probably needs a manual destroy of the failing instances
  • Change where the backups will be stored. We must change the backup section selecting another bucket o changing the serverName. The goal is to start an empty backup folder.
  • Do a manual backup using the operator and check it is working.
  • Also check the wal files are being written in the barmanObjectStore
  • Change the instances to the desired number

I have tried this using barmanObjectStore based backup.

WAL file not found in the recovery object store"

  • See the logs in every cnpg instance
  • Check the name of the failing .history file and see the .history files in the /var/lib/postgresql/wal/pg_wal/ directory of the instances
  • Ensure you have the proper IRSA permissions
  • Try to recreate all the instances with destroy | promote actions

The token included in the request has no service account role association for it

ERROR: Barman cloud backup delete exception: Error when retrieving credentials from container-role: Error retrieving metadata: Received non 200 response 404 from container metadata:
...
(ResourceNotFoundException): The token included in the request has no service account role association for it., fault: client\n\n","error":"exit status 4"

This can be caused because there were some changes in the IRSA authentication (iam role, annotation,..) To solve it, restart the cluster

Error calling the HeadBucket operation

ERROR: Barman cloud WAL archiver exception: An error occurred (403) when calling the HeadBucket operation: Forbidden"

This is an AWS IAM permissions issue. Probably you need to add "s3:ListBucket" Action permissions to the bucket itself.

{
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": "arn:aws:s3:::BUCKETNAME"
        }
    ],
    "Version": "2012-10-17"
}

"HTTP communication issue" error

Restart the controller

A replica cannot be created

If we get errors like

"requested timeline XXX is not a child of this server's history"
"Latest checkpoint is at XXX on timeline XXX, but in the history of the requested timeline, the server forked off from that timeline at YYY."

and only the primary is up. We can:

  • Do a manual backup via pgdump of every database
  • Leave the cluster with only 1 replica and no backup section
  • Rename the s3 folder or use a different serverName in the backup section.
  • Enable the backup section and do a backup via the kubectl cnpg plugin
  • If it works, increase the replicas to 3

Using the csi driver NFS

  • You can probably need to give more permissions
  • spec.postgresUID and spec.postgresGID (default 26) in the cluster resource definition gives you more possibilities
  • Don't use the subDir parameter
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nfs-postgre
parameters:
  mountPermissions: "0777"
...

This avoids some permission errors and other like:

controller with name instance-cluster already exists. Controller names must be unique to avoid multiple controllers reporting to the same metric
stale NFS file handle
This is an old primary instance in a new cluster without backup